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Preface 



The first evaluation campaign of the Cross-Language Evaluation Forum (CLEF) for 
European languages was held from January to September 2000. The campaign culmi- 
nated in a two-day workshop in Lisbon, Portugal, 21-22 September, immediately 
following the fourth European Conference on Digital Libraries (ECDL 2000). The 
first day of the workshop was open to anyone interested in the area of Cross-Language 
Information Retrieval (CLIR) and addressed the topic of CLIR system evaluation. The 
goal was to identify the actual contribution of evaluation to system development and 
to determine what could be done in the future to stimulate progress. The second day 
was restricted to participants in the CLEF 2000 evaluation campaign and to their ex- 
periments. This volume constitutes the proceedings of the workshop and provides a 
record of the campaign. 

CLEF is currently an activity of the DELOS Network of Excellence for Digital Li- 
braries, funded by the EC Information Society Technologies to further research in 
digital library technologies. The activity is organized in collaboration with the US 
National Institute of Standards and Technology (NIST). The support of DELOS and 
NIST in the running of the evaluation campaign is gratefully acknowledged. 

I should also like to thank the other members of the Workshop Steering Committee 
for their assistance in the organization of this event. 
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The objective of the Cross-Language Evaluation Forum (CLEF) is to develop and 
maintain an infrastructure for the testing and evaluation of information retrieval sys- 
tems operating on European languages, in both monolingual and cross-language con- 
texts, and to create test-suites of reusable data that can be employed by system devel- 
opers for benchmarking purposes. The first CLEF evaluation campaign started in early 
2000 and ended with a workshop in Lisbon, Portugal, 22-23 September 2000. 

This volume constitutes the proceedings of the workshop and also provides a rec- 
ord of the results of the campaign. It consists of two parts and an appendix. The first 
part reflects the presentations and discussions on the topic of evaluation for cross- 
language information retrieval systems during the first day of the workshop, whereas 
the second contains papers from the individual participating groups reporting their 
experiments and analysing their results. The appendix presents the evaluation tech- 
niques and measures used to derive the results and provides the run statistics. The aim 
of this Introduction is to present the main issues discussed at the workshop and also to 
provide the reader with the necessary background to the experiments through a de- 
scription of the tasks set for CLEF 2000. In conclusion, our plans for future CLEF 
campaigns are outlined. 



1 Evaluation for CLIR Systems 

The first two papers in Part I of the proceedings describe the organization of cross- 
language evaluation campaigns for text retrieval systems. CLEF is a continuation and 
expansion of the cross-language system evaluation activity for European languages 
begun in 1997 with the track for Cross-Language Information Retrieval (CLIR) in the 
Text REtrieval Conference (TREC) series. The paper by Harman et al. gives details 
on how the activity was organized, the various issues that had to be addressed, and the 
results obtained. The difficulties experienced during the first year, in which the track 
was coordinated centrally at NIST (US National Institute for Standards and Technol- 
ogy) led to the setting up of a distributed coordination in four countries (USA, Ger- 
many, Italy and Switzerland) with native speakers being responsible for the prepara- 
tion of topics (structured statements of possible information needs) and relevance 
judgments (assessment of the relevance of the ranked lists of results submitted by par- 
ticipating systems). A natural consequence of this distributed coordination was the 
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decision, in 1999, to transfer the activity to Europe and set it up independently as 
CLEF. The infrastructure and methodology adopted in CLEF is based on the experi- 
ence of the CLIR tracks at TREC. 

The second paper by Kando presents the NTCIR Workshops, a series of evalua- 
tion workshops for text retrieval systems operating on Asian languages. The 2000- 
2001 campaign conducted by NTCIR included cross-language system evaluation for 
Japanese-English and Chinese-English. Although both CLEF and NTCIR have a 
common basis in TREC there are interesting differences between the methodology 
adopted by the two campaigns. In particular, NTCIR employs multigrade relevance 
judgments rather than the binary system used by CLEF and inherited from TREC. 
Kando motivates this decision and discusses the effects. 

The CLEF campaign provides participants with the possibility to test their sys- 
tems on both general-purpose texts (newspapers and newswires) and domain-specific 
collections. The third paper by Kluck and Gey examines the domain-specific task, 
begun in TREC and continued in CLEF, and describes the particular document collec- 
tion used: the GIRT database for social sciences. 

The rest of the papers in the first part of this volume focus on some of the main 
issues that were discussed during the first day of the workshop. These included the 
problem of resources, the transition from the evaluation of cross-language text re- 
trieval systems to systems running on other media, the need to consider the user per- 
spective rather than concentrating attention solely on system performance, and the 
importance of being able to evaluate single system components rather than focusing on 
overall performance. A further point for discussion was the addition of new languages 
to the multilingual document collection. 

The problem of resources has always been seen as crucial in cross-language system 
development. In order to be able to match queries against documents, some kind of 
lexical resource is needed to provide the transfer mechanism, e.g. bilingual or multi- 
lingual dictionaries, thesauri, or corpora. In order to be able to process a number of 
different languages, suitable language processing tools are needed, e.g. language- 
specific tokenizers, stemmers, morphologies, etc.. It is generally held that the quality 
of the resource used considerably affects system performance. This question was dis- 
cussed at length during the workshop. The paper by Gonzalo presents a survey on the 
different language resources used by the CLEF 2000 participants. Many of the re- 
sources listed were developed by the participants themselves, thus showing that an 
evaluation exercise of this type is not only evaluating systems but also the resources 
used by the systems. The need for more pooling and sharing of resources between 
groups in order to optimize effort emerges clearly from this survey. Gonzalo concludes 
with some interesting proposals for the introduction of additional tasks, aimed at 
measuring the effect of the resources used on overall system performance, in a future 
campaign. 

The papers by Oard and by Jones both discuss CLIR from the user perspective. 
Oard focuses on the document selection question: how the users of a CLIR system can 
correctly identify the - for them - most useful documenfs from a ranked list of results 
when they cannot read the language of the target collection. He advocates the advan- 
tages of an interactive CLIR evaluation and makes a proposal as to how an evaluation 
of this type could be included in CLEF. Jones also supports the extension of evaluation 
exercises in order to assess the usefulness of techniques that can assist the user with 
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relevance judgment and information extraction. In this respect, he mentions the im- 
portance of document summarization - already included in the NTCIR evaluation 
programme. In addition, Jones talks about work in cross-language multimedia infor- 
mation retrieval and suggests directions for future research. He asserts that specifi- 
cally-developed standard test collections are needed to advance research in this area. 

In the final paper in Part I, Gey lists several areas in which research could lead to 
improvement in cross-language information retrieval including resource enrichment, 
the use of pivot languages and phonetic transliteration. In particular, he discusses the 
need for post-evaluation failure analysis and shows how this could provide important 
feedback resulting in improved system design and performance. CLEF provides the 
research community with the necessary infrastructure for studies of this type. 



2 The CLEF 2000 Experiments 

There were several reasons behind the decision to coordinate the cross-language sys- 
tem evaluation activity for European languages independently and to move it to 
Europe. One was the desire to extend the number of languages covered, another was 
the intention to offer a wider range of retrieval tasks to better meet the needs of the 
multilingual information retrieval research community. 

As can be seen from the descriptions of the experiments in Part II of this volume, 
CLEF 2000 included four separate evaluation tracks: 

• multilingual information retrieval 

• bilingual information retrieval 

• monolingual (non-English) information retrieval 

• cross-language domain-specific information retrieval 

The main task - inherited from TREC - required searching a multilingual document 
collection, consisting of national newspapers in four languages (English, French, Ger- 
man and Italian) of the same time period, in order to retrieve relevant documents. 
Forty topics were developed on the basis of the contents of the multilingual collection 
- ten topics for each collection - and complete topic sets were produced in all four 
languages. Topics are structured statements of hypothetical user needs. Each topic 
consisted of three fields: a brief title statement; a one-sentence description; a more 
complex narrative specifying the relevance assessment criteria. Queries are con- 
structed using one of more of these fields. Additional topic sets were then created for 
Dutch, Finnish, Spanish and Swedish, in each case translating from the original. The 
main requirement was that, for each language, the topic set should be as linguistically 
representative as possible, i.e. using the terms that would naturally be expected to rep- 
resent the set of topic concepts in the given language. The methodology followed was 
that described in the paper by Harman et al.. 

A bilingual system evaluation task was also offered, consisting of querying the 
English newspaper collection using any topic language (except English). Many new- 
comers to cross-language system evaluation prefer to begin with the simpler bilingual 
task before moving on to tackle the additional issues involved in truly multilingual 
retrieval. 
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One of the aims of the CLEF activity is to encourage the development of tools to 
manipulate and process languages other than English. Different languages present 
different problems. Methods that may be efficient for certain language typologies may 
not be so effective for others. Issues that have to be catered for include word order, 
morphology, diacritic characters, language variants. For this reason, CLEF 2000 in- 
cluded a track for French, German and Italian monolingual information retrieval. 

The cross-language domain-specific task has been offered since TREC-7. The ra- 
tionale of this subtask is to test retrieval on another type of document collection, serv- 
ing a different kind of information need. The implications are discussed in the paper 
by Kluck and Gey in the first part of this volume. 

The papers in Part II describe the various experiments by the participating groups 
with these four tasks. Both traditional and innovative approaches to CLIR were ex- 
perimented, and different query expansion techniques were tried. All kinds of source 
to target transfer mechanisms were employed, including both query and document 
translation. Commercial and in-house resources were used and included machine 
translation, dictionary and corpus-based methods. The strategies used varied from 
traditional IR to a considerable employment of natural language processing tech- 
niques. Different groups focused on different aspects of the overall problem, ranging 
from the development of language-independent tools such as stemmers to much work 
on language-specific features like morphology and compounding. Many groups com- 
pared different techniques in different runs in order to evaluate the effect of a given 
technique on performance. Overall, CLEF 2000 offered a very good picture of current 
issues and approaches in CLIR. 

The first paper in this part by Martin Braschler provides an overview and analysis 
of all the results, listing the most relevant achievements and comparing them with 
those of previous years in the CLIR track at TREC. As one of the main objectives of 
CLEF is to produce evaluation test-suites that can be used by the CLIR research com- 
munity, Braschler also provides an analysis of the test collection resulting from the 
CLEF 2000 campaign, demonstrating its validity for future system testing, tuning and 
development activities. The appendix presents the evaluation results for each group, 
run by run. 



3 CLEF in the Future 

The CLEF 2001 campaign is now under way. The main tasks are similar to those of 
the first campaign. There are, however, some extensions and additions. In particular 
the multilingual corpus has been considerably enlarged and Spanish (news agency) and 
Dutch (national newspaper) collections for 1994 have been added. The multilingual 
task in CLEF 2001 involves querying collections in five languages (English, French, 
German, Italian and Spanish) and there will be two bilingual tracks: searching either 
the English or the Dutch collections. Spanish and Dutch have also been included in the 
monolingual track. There will be seven official topic languages, including Japanese. 
Additional topics will be provided in a number of other European languages, including 
Finnish, Swedish and Russian, and also in Chinese and Thai. 
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CLEF 2000 concentrated on the traditional metrics of recall and precision - 
however these have limitations in what they tell us about the usefulness of a retrieval 
system to the user. CLEF 2001 will thus also include an experimental track designed 
to test interactive CLIR systems and to establish baselines against which future re- 
search progress can be measured. The introduction of this track is a direct result of 
discussions which began in the workshop with the presentations by Card and by Jones, 
and of the proposal by Card reported in Part I of this volume. 

Two main issues must be considered when planning future CLEF campaigns: the 
addition of more languages, and the inclusion of new tasks. 

The extension of language coverage, discussed considerably at the workshop, de- 
pends on two factors: the demand from potential participants and the existence of suf- 
ficient resources to handle the requirements of new language collections. It was de- 
cided that Spanish and Dutch met these criteria for CLEF 2001. CLEF 2002 and 2003 
will be mainly funded by a contract from the European Commission (IST-2000-3 1002) 
but it is probable that, in the future, it will be necessary to seek support from national 
funding agencies as well if more languages are to be included. The aim will be to 
cover not only the major European languages but also some representative samples of 
minority languages, including members from each major group: e.g. Germanic, Ro- 
mance, Slavic, and Ugro-Finnic languages. Furthermore, building on the experience of 
CLEF 2001, we intend to continue to provide topics in Asian languages. 

CLEF 2000 concentrated on cross-language text retrieval and on measuring over- 
all system performance. However, in the future, we hope to include tracks to evaluate 
CLIR sysfems working on media other than text. We are now beginning to examine the 
feasibility of organizing a spoken CLIR track in which systems would have to process 
and match spoken queries in more than one language against a spoken document col- 
lection. Another important innovation would be to devise methods that enable the as- 
sessment of single system components, as suggested in the paper by Gonzalo. 

CLIR sysfem development is still very much in the experimental stage and involves 
expertise from both the natural language processing and the information retrieval 
fields. The CLEF 2000 Workshop provided an ideal opportunity for a number of key 
players, with very different backgrounds, to come together and exchange ideas and 
compare results on the basis of a common experience: participation in the CLEF 
evaluation campaign. CLEF is very much a collaborative effort between organizers 
and participants with the same common goal: the improvement of CLIR sysfem per- 
formance. The discussions at the workshop have had considerable impact on the or- 
ganization of the 2001 campaign. The success of future campaigns will depend on the 
continuation and strengthening of this collaboration. 

More information on the organization of the current CLEF campaign and instruc- 
tions on how to contact us can be found at: http://www.clef-campaign.org/. 
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Abstract. Starting in 1997, the National Institute of Standards and 
Technology conducted 3 years of evaluation of cross-language information 
retrieval systems in the Text REtrieval Conference (TREC). Twenty- 
two participating systems used topics (test questions) in one language to 
retrieve documents written in English, French, German, and Italian. A 
large-scale multilingual test collection has been built and a new technique 
for building such a collection in a distributed manner was devised. 



1 Introduction 

The increasing globalization of information has led to an heightened interest in 
retrieving information that is in languages users are unable search effectively. 
Often these users can adequately read retrieved documents in non-native lan- 
guages, or can use existing gisting systems to get a good idea of the relevance of 
the returned documents, but are not able to create appropriate search questions. 
Ideally they would like to search in their native language, but have the ability 
to retrieve documents in a cross-language mode. 

The desire to build better cross-language retrieval systems resulted in a work- 
shop on this subject at the Nineteenth Annual International ACM-SIGIR Con- 
ference on Research and Development in Information Retrieval in 1996. Whereas 
many of the participants at this conference were concerned with the lack of suf- 
ficient parallel text to form a basis for research, one of the papers presented at 
that workshop provided the hope of avoiding the use of parallel corpora by the 
use of comparable corpora. 

This paper, by Paraic Sheridan, Jean Paul Ballerini and Peter Schauble of 
the Swiss Federal Institute of Technology (ETH), P, used stories from the Swiss 
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news agency Schweizerische Depeschen Agentur (SDA) that were taken from the 
same time period. These newswire stories are not translations but are produced 
independently in each language (French, German and Italian) in the various 
parts of Switzerland. Whereas the stories do not overlap perfectly, there is in 
fact a high overlap of stories (e.g. international events) which are of interest in 
all parts of Switzerland. The paper detailed the use of this collection of stories to 
produce a test collection that enabled the evaluation of a series of cross-language 
retrieval experiments [^. 

In 1997 it was decided to include cross-language information retrieval (CLIR) 
system evaluation as one of the tracks at the Sixth Text REtrieval Conference 
(TREC) held at the National Institute of Standards and Technology (NIST) 
|http : //tree .nist .govj. The aim was to provide researchers with an infras- 
tructure for evaluation that would enable them to test their systems and com- 
pare the results achieved using different cross-language strategies. This track 
was done in cooperation with the Swiss Federal Institute of Technology, who 
not only obtained permission for TREC to use the SDA data, but also provided 
considerable guidance and leadership to the track. 

The main goals of the CLIR track in TREC were: 

1. to create the infrastructure for testing cross-language information retrieval 
technology through the creation of a large-scale multilingual test collection 
and a common evaluation setting; 

2. to investigate effective evaluation procedures in a multilingual context; and 

3. to provide a forum for the exchange of research ideas. 

There were CLIR tracks for European languages in TREC-6, TREC-7, 
and TREC-8. The TREC proceedings for each year (available on-line at 
|http : //tree .nist ,gov|, contain overviews of the track, plus papers from all 
groups participating in the CLIR track that year. The rest of this paper sum- 
marizes the CLIR work done in those three years, with those summaries derived 
from the various track overviews g, g, ||. To conserve space, the numerous 
individual papers are not included in the references but can be found in the 
section for the cross-language track in the appropriate TREC proceedings. A 
table listing all participants for a given TREC is given in each result section 
to faciliate the location of the individual papers. Note that there are additional 
publications from these groups including further results and analyses, and the 
references in the track overviews should be checked to obtain these. 

2 TREC-6 CLIR Track Task Description 

The TREC-6 Cross-Language Information Retrieval (CLIR) track required the 
retrieval of either English, German or French documents that are relevant to top- 
ics written in a different language. Participating groups could choose any cross- 
language combination, for example English topics against German documents 
or French topics against English documents. In order have a baseline retrieval 
performance measurement for each group, the results of a monolingual retrieval 
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experimental run in the document language were also to be submitted. For in- 
stance, if a cross-language experiment was run with English topics retrieving 
German documents, then the result of an equivalent experiment where German 
topics retrieve German documents must also have been submitted. These re- 
sults would be considered comparable since the topics are assumed to be proper 
translations across the languages. 

The different document collections used for each language are outlined in 
Table m The Associated Press collection consists of newswire stories in English, 
while the French SDA collection is a similar collection of newswire stories from 
the Swiss news agency (Schweizerische Depeschen Agentur). The German doc- 
ument collection has two parts. The first part is composed of further newswire 
stories from the Swiss SDA while the second part consists of newspaper articles 
from a Swiss newspaper, the ‘Neue Zuercher Zeitung’ (NZZ). The Italian data 
is included in this table for completeness although it was not used in TREG-6. 

The newswire collections in English, French and German were chosen to 
overlap in timeframe (1988 to 1990) for two reasons. First, since a single set 
of topics had to be formulated to cover all three document languages, having 
the same timeframe for newswire stories increased the likelihood of finding a 
greater number of relevant documents in all languages. The second reason for 
the overlapping timeframe was to allow groups who use corpus-based approaches 
for cross-language retrieval to investigate what useful corpus information they 
could extract from the document collections being used. One of the resources 
provided to GLIR track participants was a list of 83,698 news documents in the 
French and German SDA collections which were likely to be comparable based 
on an alignment of stories using news descriptors assigned manually by the SDA 
reporters, the dates of the stories, and common cognates in the texts of the 
stories. 



Document Collections 


Doc. Language 


Source 


No. Documents 


Size 


English 


AP news, 1988-1990 


242,918 


760MB 


German 


SDA news, 1988-1990 
NZZ articles, 1994 


185,099 

66,741 


330MB 

200MB 


French 


SDA news, 1988-1990 


141,656 


250MB 


Italian 


SDA news, 1989-1990 


62,359 


90MB 



Table 1. Document Gollections used in the GLIR track. 



The 25 test topic descriptions were provided by NIST in English, French 
and German, using translations of topics originally written mostly in English 
(see Figure ID for an example topic, including all its translations). Participating 
groups who wished to test other topic languages were permitted to create trans- 
lations of the topics in their own language and use these in their tests, as long 
as the translated topics were made publicly available to the rest of the track 
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participants. The final topic set therefore also had translations of the 25 topics 
in Spanish, provided by the University of Massachusetts, and Dutch, provided 
by TNO in the Netherlands. 



<num> Number : CL9 
<E-title> Effects of logging 

<E-desc> Description: 

What effects has logging had on desertification? 

<E-narr> Narrative: 

Documents with specific mention of local government’s or international 
agencies’ efforts to stop deforestation are relevant. Also relevant 
are documents containing information on desertification and its 
side effects such as climate change, soil depletion, flooding, and 
hurricanes caused by excessive logging. 

<num> Number : CL9 

<F-title> Les effets de la deforestation 
<F-desc> Description: 

Quels sont les effets de la deforestation sur la desertification? 
<F-narr> Narrative: 

Tous les documents qui donnent des analyses specifiques sur les mesures 
des gouverments locaux ou des agences Internationales pour frener 
la deforestation sont pertinants. Les articles qui contiennent des 
renseignements sur la desertification et ses effets secondaires comme 
les changements de climat, I’epuisement de la terre, les inondations et 
les ouragans sont egalement applicables . 

<num> Number : CL9 

<G-title> Auswirkungen von Abholzung 
<G-desc> Description: 

Welche Auswirkungen hat das Abholzen auf die Ausbreitung der Wiiste? 
<G-narr> Narrative: 

Alle Artikel fiber Bemfihungen von Regierungen ebenso wie von 
internationalen Agenturen die Wfistenausbreitung zu bremsen, sind 
wesentlich. Ebenso relevant sind Artikel fiber Ausbreitung der Wfisten 
und ihre Mitwirkungen, wie zum Beispiel Klimawechsel , Verarmung der 
Erde und Orkane die auf fibermassige Abholzung zurfickzuffihren sind. 

Fig. 1. Sample CLIR topic statement from TREC-6, showing all languages. 
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Although not strictly within the definition of the cross-language task, partici- 
pation by groups who wanted to run mono-lingual retrieval experiments in either 
French or German using the CLIR data was also permitted. Since the CLIR track 
was run for the first time in TREC-6, this was intended to encourage new IR 
groups working with either German or French to participate. The participation 
of these groups also helped to ensure that there would be a sufficient number of 
different system submissions to provide the pool of results needed for relevance 
judgements. 

The evaluation of CLIR track results was based on the standard TREC eval- 
uation measures used in the ad hoc task. Participating groups were free to use 
different topic fields (lengths) and to submit either automatic or manual exper- 
iments according to the definitions used for the main TREC ad hoc task. 

3 TREC-6 Results 

A total of thirteen groups, representing six different countries, participated in the 
TREC-6 CLIR track (Table Ej) . Participating groups were encouraged to run as 
many experiments as possible, both with different kinds of approaches to CLIR 
and with different language combinations. An overview of the submitted runs is 
given in Table 0and shows that the main topic languages were used equally, 
each used in 29 experiments, whereas English was somewhat more popular than 
German or French as the choice for the document language to be retrieved. This 
is in part because the groups who used the query translations in Spanish and 
Dutch only evaluated those queries against English documents. A total of 95 
result sets were submitted for evaluation in the CLIR track. 



TREC-6 Participants 


Participant 


Country 


CEA/Saclay (no online paper) 


France 


Cornell/SabIR Research Inc. 


USA 


Dublin City University 


Ireland 


Duke University/University of Colorado/Microsoft Research 


USA 


IRIT/SIG 


France 


New Mexico State University 


USA 


Swiss Federal Institute of Technology (ETH) 


Switzerland 


TwentyOne(TN 0 /U-Twente/DFKI/Xerox/U-Tuebingen 


Netherlands 


University of California, Berkeley 


USA 


University of Maryland, College Park 


USA 


University of Massachnsetts, Amherst 


USA 


Universite of Montreal/Laboratoire CLIPS, IMAG 


Canada 


Xerox Research Gentre Europe 


France 



Table 2. Organizations participating in the TREC-6 CLIR track 
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Language Combinations 




Qnery Language 




Doc. Langnage 


English 


German 


French 


Spanish 


Dutch 


Total 


English 


7 


15 


10 


2 


6 


40 


German 


12 


10 


4 


- 


- 


26 


French 


10 


4 


15 


- 


- 


29 


Total 


29 


29 


29 


2 


6 


95 



Table 3. Overview of submissions to CLIR track. 



An important contribution to the track was made by a collaboration between 
the University of Maryland and the LOGOS corporation, who provided a ma- 
chine translation of German documents into English. Only the German SDA 
documents were prepared and translated in time for the submission deadline. 
This MT output was provided to all participants as a resource, and was used 
to support experiments run at ETH, Duke University, Gornell University, the 
University of Galifornia at Berkeley, and the University of Maryland. 

Gross-language retrieval using dictionary resources was the approach taken 
in experiments submitted by groups at New Mexico State University, University 
of Massachusetts, the Gommissariat a I’Energie Atomique of France, the Xerox 
Research Gentre Europe, and TNO in the Netherlands. Machine readable dictio- 
naries were obtained from various sources, including the Internet, for different 
combinations of languages, and used in different ways by the various groups. 

The corpus-based approach to GLIR was evaluated by ETH, using similar- 
ity thesauri, and the collaborative group of Duke University, the University of 
Golorado, and Bellcore, who used latent semantic indexing (LSI). An innovative 
approach for cross-language retrieval between English and French was tested 
at Gornell University. This approach was based on the assumption that there 
are many similar-looking words (near cognates) between English and French 
and that, with some simple matching rules, relevant documents could be found 
without a full translation of queries or documents. 

An overview of results for each participating group is presented in Figure El 
This figure represents the results based on only 21 of the 25 test topics, but the 
results from all 25 are not significantly different. The figure shows results for each 
group and each document language for which experiments were submitted. The y 
axis represents the average precision achieved for the best experiment submitted 
by each group and each document language. Gross-language experiments are 
denoted by, for example, ‘X to French whereas the corresponding monolingual 
experiments are denoted, ‘French’. For example, the figure shows that the best 
experiment submitted by Gornell University performing cross-language retrieval 
of French documents achieved average precision of 0.2. 

Note that the presentation of results in Figure El does not distinguish be- 
tween fully automatic cross-language retrieval, and those groups who included 
some interactive aspect and user involvement in their experiments. The groups 
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at Xerox, Berkeley and Dublin City University submitted experiments which in- 
volved manual interaction. Also some groups participated only in a monolingual 
capacity: Dublin City University, University of Montreal, and IRIX France. 



I English 
I X to English 
^ French 
^ X to French 
[] German 



Q X to German 



j Spanish to English 



0,10 



LSI Cornell Xerox INO Merouro UMontieal CEA NMSU DCU Btkly UMd INQUEW 
Fig. 2. CLIR Track Results (Average Precision, best run) 
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Although Figure 0 does not provide a sound basis for between-group com- 
parisons, some general comments can be made on the overall results. Comparing 
cross-language results to the corresponding monolingual experiments, it seems 
that cross-language retrieval is performing in a range of roughly 50% to 75% of 
the equivalent monolingual case. This is consistent with previous evaluations of 
cross-language retrieval. Many different approaches to cross-language retrieval 
were tried and evaluated, and groups using each of the different approaches have 
achieved good results. For example, the corpus based method used by ETH to 
perform cross-language retrieval for German documents worked as well as the 
machine translation based methods used by the University of Maryland and Cor- 
nell. The dictionary based method used by Xerox for cross-language retrieval to 
French did about the same as the use of cognate overlap by Cornell. 

4 TREC-6 Evaluation Issues 

In general the testing paradigm and test collection used in the TREC-6 CLIR 
track worked well, but there were two issues that caused concern. First, the 
many possible language pairs used by the various participants made it difficult 
to compare across systems, and presented a somewhat unrealistic evaluation in 
that many situations require retrieval of documents irregardless of the language 
of those documents. This would suggest that an improved task would be the 
retrieval of a ranked list of documents in all three languages, i.e. a merged list, 
and this task was implemented in TREC-7. 

The second issue was more difficult to solve. The TREC-6 topics were created 
at NIST by two persons who were native English speakers but who had strong 
skills in French and German. Because these people were new to TREC and 
NIST staff was unable to provide much guidance due to lack of knowledge skills, 
the TREC-6 CLIR topics are more simplistic than TREC topics normally done 
in English, and this may have allowed the simpler CLIR techniques to work 
better than would be expected. Additionally there were some problems with the 
translations produced for the topics at NIST, and corrections needed to be made 
by native speakers before the topics could be released. As a final problem, NIST 
assessors working in non-native languages tend to be much slower in making 
relevance judgments, and this became considerably worse when working in three 
languages. Only 13 out of 25 topics were evaluated in time for any analysis before 
TREC, with the rest not finished until several months later. This problem with 
non-native speakers led to forming collaborative partnerships for the evaluation 
effort in TREC-7. 

5 TREC-7 CLIR Track Task Description 

In TREC-7, the task was changed slightly and participants were asked to re- 
trieve documents from a multilingual pool. They were able to chose the topic 
language, and then had to find relevant documents in the pool regardless of the 
languages the texts were formulated in. As a side effect, this meant that most 
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groups had to solve the additional task of merging results from various bilin- 
gual runs. The languages present in the pool were English, German, French and 
Italian, with Italian being a new language introduced for TREC-7. There were 
28 topics distributed, each topic being translated into four languages. To allow 
for participation of groups that did not have the resources to work in all four 
languages, a secondary evaluation was provided that permitted such groups to 
send in runs using English topics to retrieve documents from a subset of the pool 
just containing texts in English and French. There were no monolingual runs as 
part of the cross-language track in TREC-7. 

The TREC-7 task description also defined a subtask (GIRT), working with a 
second data collection containing documents from a structured data base in the 
field of social science. Unfortunately, the introduction of this data was probably 
premature, since no groups were able to work with this data in TREC-7. The data 
was used again in TREC-8 (see task description in TREC-8 for more information 
on this data). 

The document collection for the main task contained the same documents 
used in TREC-6, with an extension to Italian texts from SDA (see Table [Q. 
Note that Italian texts were only available for 1989 and 1990, and therefore the 
Italian SDA collection is considerably smaller than the SDA for French or the 
English AP texts. 

There were significant changes in the way the topics were created for TREC-7 
because of the problems in TREC-6. Four different sites, each located in an 
area where one of the topic languages is natively spoken, worked on both topic 
creation and relevance judgments. 

The four sites were: 

— English: NIST, Gaithersburg, MD, USA (Ellen Voorhees) 

— French: EPFL Lausanne, Switzerland (Afzal Ballim) 

— German: IZ Sozialwissenschaften, Germany (Jurgen Krause, Michael Kluck) 

— Italian: CNR, Pisa, Italy (Carol Peters). 

Seven topics were chosen from each site to be included in the topic set. The 
21 topics from the other sites were then translated, and this ultimately led to a 
collection of 28 topics, each available in all four languages. Relevance judgments 
were made at all four sites for all 28 topics, with each site examining only the 
pool of documents in their native language. 



6 TREC-7 Results 

A total of nine groups from five different countries submitted results for the 
TREC-7 CLIR track (Table EJ. The participants submitted 27 runs, 17 for 
the main task, and 10 for the secondary English to French/English evaluation. 
Five groups (Berkeley, Eurospider, IBM, Twenty-One and Maryland) tackled 
the main task. English was, not surprisingly, the most popular topic language, 
with German coming in a strong second. Every language was used by at least 
one group. 
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TREC-7 Participants 


Participant 


Conntry 


CEA (Commissariat a Energie Atomique) 


France 


Eurospider Information Technology AG 


Switzerland 


IBM T.J. Watson Research Center 


USA 


Los Alamos National Laboratory 


USA 


TextWise LLC 


USA 


Twenty-One(University of Twente/TNO-TPD) 


Netherlands 


University of California, Berkeley 


USA 


University of Maryland, College Park 


USA 


Universite of Montreal/Laboratoire CLIPS, IMAG 


Canada 



Table 4. Organizations participating in the TREC-7 CLIR track 



Figure 0 shows a comparison of runs for the main task. Shown are the best 
automatic runs against the full document pool for each of the five groups that 
worked on the main task. As can be seen, most participants performed in a fairly 
narrow band. This is interesting given the very different approaches of the indi- 
vidual participants: IBM used translation models automatically trained on par- 
allel and comparable corpora, Twenty-One used sophisticated dictionary lookup 
and a ’’boolean-flavoured” weighting scheme. Eurospider employed corpus-based 
techniques, using similarity thesauri and pseudo-relevance feedback on aligned 
documents and the Berkeley and Maryland groups used off-the-shelf machine 
translation systems. 

A particularly interesting aspect of TREC-7 CLIR track was how partici- 
pants approached the merging problem. Again, many interesting methods were 
used. Among the solutions proposed were: Twenty-One compared averages of 
similarity values of individual runs. Eurospider used document alignments to 
map runs to comparable score ranges through linear regression and IBM used 
modeling of system-wide probabilities of relevance. But it was also possible to 
avoid the merging problem, for example, the Berkeley group expanded the topics 
to all languages and then ran them against an index containing documents from 
all languages, therefore directly retrieving a multilingual result list. 



7 TREC-7 Evaluation Issues 

One of the distinguishing features of the TREC-7 CLIR track was that the topic 
development and relevance assessments were done in a distributed manner. Based 
on the experiences of TREC-6, this was a critical necessity, but it is important 
to understand the possible impact of this decision on the results. 

Topic development is clearly subjective, and depends on the creator’s own 
particular background. Additionally for CLIR it must be presumed that both 
the language and cultural background also impact the choice and phrasing of 
topics. A close examination of the topics in TREC-7 would probably permit an 
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Fig. 3. Results of the main TREC-7 CLIR evaluation, X to EGFI 



astute observer to group them fairly accurately according to source language 
and creation site. This should not be considered negative nor should it affect the 
validity of the results. However, it causes some problems both in the translation 
of the topics and in their assessment. 

Topic translation raises the typical problems involved in any translation: 
a total understanding of the source is necessary in order to achieve a perfect 
rendering of the target. But this is complicated in CLIR by the need to find an 
acceptable balance between precision with respect to the source and naturalness 
with respect to the target language. Ideally the translations should reflect how a 
native-speaker would phrase a search for that topic in their language and culture. 

Accurate assessment of relevance for retrieved documents for a given topic 
implies a good understanding of the topic. The fact that the CLIR track used a 
distributed scenario for building topics and making relevance judgments meant 
that relevance judgments were usually not done by the creators of the topics. In 
addition to general problems of judgment consistency when this occurs, there is 
also the influence of the multilingual/multicultural characteristics of the task. 
The way a particular topic is discussed in one language will not necessarily 
be reproduced in the documents in other languages. Therefore a topic which 
did not appear to raise problems of interpretation in the language used for its 
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preparation may be much more difficult to assess against documents in another 
language. 

There were no problems reported by the participants with either the topic 
creation, the translations, or the relevance judgments. Nevertheless, it was de- 
cided to work on closer coordination between the four groups in TREC-8, and 
to get a fifth group that specializes in translations to check all final topic trans- 
lations for both accuracy and naturalness. The effect of the distributed method 
of relevance judgments on results is probably small since the distribution was 
across languages, not topics. As long as results are compared within the same 
language, i.e. pairs of results on German documents, and not across languages, 
i.e. results on English documents vs German documents, there are unlikely to be 
issues here. Gomparing results from retrieving documents in different languages 
is equivalent to comparison of results using two different human judges, and 
therefore this comparison should be avoided. 

8 TREC-8 Task Description 

The GLIR task in TREG-8 was similar to that in TREG-7. The document col- 
lection was the same, and 28 new topics were provided in all four languages. In 
order to attract newcomers, monolingual non-English runs were accepted; how- 
ever, participants preferred to do bilingual cross-language runs when they could 
not do the full task. 

The TREG-8 task description also included the vertical domain subtask, 
containing documents from a structured database in the field of social science 
(the “GIRT” collection) . This collection comes with English titles for most doc- 
uments, and a matching bilingual thesaurus. The University of Galifornia at 
Berkeley conducted some very extensive experiments with this collection. 

The topic creation and relevance assessment sites for TREG-8 were: 

— English: NIST, Gaithersburg, MD, USA (Ellen Voorhees) 

— French: University of Zurich, Switzerland (Michael Hess) 

— German: IZ Sozialwissenschaften, Germany (Jurgen Krause, Michael Kluck) 

— Italian: GNR, Pisa, Italy (Garol Peters). 

At each site, an initial 10 topics were formulated. At a topic selection meeting, 
the seven topics from each site that were felt to be best suited for the multilingual 
retrieval setting were selected. Each site then translated the 21 topics formulated 
by the others into the local language. This ultimately led to a pool of 28 topics, 
each available in all four languages. It was decided that roughly one third of 
the topics should address national/regional, European and international issues, 
respectively. To ensure that topics were not too broad or too narrow and were 
easily interpretable against all document collections, monolingual test searches 
were conducted. As a final check on the translations. Prof. Ghrista Womser- 
Hacker from the University of Hildesheim volunteered her students to review all 
topic translations. 
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9 TREC-8 Results 

A total of twelve groups from six different countries submitted results for the 
TREC-8 CLIR track (Table 0. Eight participants tackled the full task (up from 
five in TREC-7), submitting 27 runs (up from 17). The remainder of the par- 
ticipants either submitted runs using a subset of languages, or concentrated on 
the GIRT subtask only. English was the dominant topic language, although each 
language was used by at least one group as the topic language. 



TREC-8 Participants 


Participant 


Country 


Claritech 


USA 


Eurospider Information Technology AG 


Switzerland 


IBM T.J. Watson Research Center 


USA 


IRIT/SIG 


France 


John Hopkins University, APL 


USA 


MNIS-TextWise Labs 


USA 


New Mexico State University 


USA 


Sharp Laboratories of Europe Ltd 


England 


Twenty-One(University of Twente/TNO-TPD) 


Netherlands 


University of California, Berkeley 


USA 


University of Maryland, College Park 


USA 


Universite of Montreal/Laboratoire CLIPS, IMAG 


Canada 



Table 5. Organizations participating in the TREC-8 CLIR track 



Figure 21shows a comparison of runs for the main task. The graph shows the 
best runs against the full document pool for each of the eight groups. Because of 
the diversity of the experiments conducted, the figures are best compared on the 
basis of the specific features of the individual runs, details of which can be found 
in the track papers. For example. New Mexico State runs use manually translated 
queries, which are the result of a monolingual user interactively picking good 
terms. This is clearly an experiment that is very different from the runs of some 
other groups that are essentially doing “ad hoc” style cross-language retrieval, 
using no manual intervention whatever. 

Approaches employed in TREC-8 by individual groups include: 

— experiments on pseudo relevance feedback by Claritech 

— similarity thesaurus based translation by Eurospider 

— statistical machine translation by IBM 

— combinations of n-grams and words by John Hopkins University 

— use of conceptual interlingua by MNIS-Textwise 

— query translation using bilingual dictionaries by Twenty-One 

— evaluation of the Pirkola measure by University of Maryland 



20 



Donna Harman et al. 




Recall 



ibmdSfa 
— • — aplxll 
— h — tnoSdpx 
— CLARITrmwfi 
— ■ — Mer8Can2xO 
— K — EIT99mta 
• umd99bl 
— I — nmsuil 



Fig. 4. Results of the main TREC-8 CLIR evaluation, X to EGFI 



— transaction models derived from parallel text by University of Montreal 

— use of an online machine translation system by IRIT 

Merging remained an important issue for most participants. The University 
of Maryland tried to circumvent the problem by using an unified index in some 
of their runs, but the other groups working on the main task all had to rely on 
merging of some sort to combine their individual, bilingual cross-language runs. 
Some of the approaches this year include: merging based on probabilities that 
were calculated using log(Rank) by various groups including IBM, merging using 
linear regression on document alignments by Eurospider, linear combinations of 
scores by John Hopkins, and of course, straight, score-based merging. 

Two groups submitted runs for the GIRT subtask. The University of Galifor- 
nia at Berkeley participated exclusively in the subtask only, and did some very 
comprehensive experiments using both the English titles of the documents and 
the English/ German thesaurus supplied with the collection. These runs show 
some of the interesting properties of GIRT. It is also possible to do ad hoc style 
runs on GIRT, ignoring controlled vocabulary, English titles and the thesaurus. 
This approach was taken by Eurospider. 
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10 TREC-8 Evaluation Issues and Summary 

It was generally felt that the final check on the translation quality and the 
elimination of topics that were likely to have problems in interpretation across 
languages improved the process of distributed evaluation. Two issues remain 
however that warrant further discussion. These issues are not unique to TREC-8, 
although they appear to have grown worse over the three years of CLIR in TREC. 

First, there is the issue of the size of the pools in the various languages. 
Relevance judgments in TREC are done using a pooling method, i.e. the top- 
ranked documents from each submitted run are put into a ’’pool”, removing 
duplicates, and this pool is judged by humans. There have been studies done 
both on the completeness of these pools and on the consistency of relevance 
judgments across assessors 0. m), with the results showing that the pooling 
method produces stable results for the 50-topic TREC (English) ad hoc task. 

But these conclusions are based on having enough topics to allow a stable 
average across assessors, enough documents in the pools to assure most relevant 
documents have been found, and enough participating groups to contribute dif- 
ferent sets of documents to the pool. Voorhees showed that the use of 25 topics 
is a minimum to insure stability across assessors, and therefore the averages for 
the CLIR results can be considered stable for comparison. 

The small size of the pools, particularly in German and Italian, may imply 
that the collections cannot be viewed as complete. For TREC-6, where mostly 
monolingual runs were judged, there was a per-topic average of 350 documents 
judged in English and in German (500 in French). But the merged runs judged 
for TRECs 7 and 8 produced far fewer documents for German and Italian in 
the pools (160 German/100 Italian judged for TREC-7; 146 German/155 Italian 
for TREC-8), and it is likely that additional relevant documents exist for these 
languages in the collection. This does not make the use of these collections 
invalid, but does require caution in their use when it is probable that many new 
Italian or German documents will be retrieved. For further analysis on this point 
see pi], and 

The second issue involves the problem of cross-language resources. Looking 
at TREC-8 for example, two main points stand out with respect to the main 
task: first, 21 out of 27 submitted runs used English as the topic language, and 
second, at least half of all groups used the Systran machine translation system 
in some form for parts of their experiments. While English was also the most 
popular choice for TREC-7, the percentage of runs that used non-English topics 
was substantially higher (7 out of 17). 

Part of the reason for the heavy use of English as the topic language is that 
75% of the TREC-8 participants are from English speaking countries. But an 
additional factor is the lack of resources that do not use English as the source 
language, e.g. dictionaries for German to Italian. One reason for the choice of 
Systran by so many groups also lies in a lack of resources: using Systran allowed 
the groups to do something with certain language pairs that they would otherwise 
not have been able to include in their experiments. Because Systran offers mainly 
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combinations of English with other languages, this influenced the domination of 
English as topic language. 

Both of these reasons contributed to the decision to move the European 
cross-language task to Europe in 2000 within the new CLEF evaluation. It was 
generally felt that more Europeans would join such an activity and that these 
groups would bring with them increased knowledge of non-English resources. 

The three years of European cross-language evaluation done at NIST not only 
achieved the initial goals, but laid the foundation for continued CLIR evaluation 
in Europe and now starting in Asia. The first large-scale test collection for cross- 
language retrieval was built and will continue to be distributed for test purposes. 
Twenty-two groups have taken part in the evaluations, cumulatively reporting 
over 100 experiments on diverse methods of cross-language retrieval. And Anally, 
a new technique has been devised to produce the necessary topics and relevance 
judgments for the test collections in a distributed manner such that the collection 
properly reflects its multilingual and multicultural origins. 
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Abstract. This paper introduces the NTCIR Workshops, a series of evaluation 
workshops designed to enhance research in Japanese and Asian language text 
retrieval, cross-lingual information retrieval, and related text processing 
techniques such as summarization, extraction, etc. by providing large-scale test 
collections and a forum of researchers. Twenty-eight groups from six countries 
participated in the first workshop and forty-six groups from eight countries have 
registered for the second. The test collections used in the Workshops are 
basically TREC-type collections but they contain several unique characteristics 
including multi-grade relevance judgments. Finally some thoughts on future 
directions are suggested. 



1 Introduction 

The purposes of the NTCIR Workshop [1] are the following: 

1. to encourage research in information retrieval (IR), cross-lingual information 
retrieval (CLIR) and related text processing technology including term recognition, 
information extraction and summarization by providing large-scale reusable test 
collections and a common evaluation setting that allows cross-system comparisons 

2. to provide a forum for research groups interested in comparing results and 
exchanging ideas or opinions in an informal atmosphere 

3. to investigate methods for constructing test collections or data sets usable for 
experiments and methods for laboratory-type testing of IR and related technology 

For the first NTCIR Workshop, the process started with the distribution of the 
training data set on 1st November 1998, and ended with the workshop meeting which 
was held on 30th August - 1st September 1999 in Tokyo, Japan [2]. The participation 
in the workshop was limited to the active participants, i.e. the members of the 
research groups that submitted the results of the tasks. Many interesting papers with 
various approaches were presented and the meeting ended in enthusiasm. The third 
day of the Workshop was organised as the NTCIR/IREX Joint Workshop. The IREX 
Workshop [3], another evaluation workshop for IR and information extraction (named 
entities) using Japanese newspaper articles, was held consecutively. IREX and 
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NTCIR worked together to organise the second NTCIR Workshop. The research 
group in National Taiwan University has proposed the Chinese IR Task and is 
organising it at the NTCIR Workshop 2. The process of the NTCIR Workshop 2 
started in June 2000 and will be ended with the meeting on 7-9th March 2001 [4]. 

From the beginning of the NTCIR project, we have focused on two directions of 
investigation, i.e., (1) traditional IR system testing and (2) challenging issues. For the 
former, we have placed emphasis on IR with Japanese or other Asian languages and 
CLIR. Indexing texts written in Japanese or other East Asian languages like Chinese 
is quite different from indexing texts in English, French or other European languages 
since there is no explicit boundary (i.e. no space) between words in a sentence. CLIR 
is critical in the Internet environment, especially between languages with completely 
different origins and structure like English and Japanese. Moreover in scientific texts 
or everyday-life documents like Web documents, foreign language terms often appear 
in Japanese texts both in their original spelling and in transliterated forms. To 
overcome the word mismatch that may be caused by such expression variance, cross- 
linguistic strategies are needed for even monolingual retrieval Japanese documents of 
the type described in [5]. 

For the challenging issues, we have been interested in (2a) document genres (or 
types), and (2b) intersection of natural language processing (NLP) and IR. Each 
document genre has own user group and way of usage, and the criteria determining 
"successful search" may vary accordingly though traditional IR research has looked at 
the generalised system which can handle any kind of documents. For example, Web 
document retrieval has different characteristics from those of newspaper or patent 
retrieval both with respect to the nature of the document itself and the way of usage. 
We have investigated appropriate evaluation methods for each genre. 

In IR with Asian Languages, NLP can play important roles such as identifying 
word boundaries and so on. Moreover, NLP techniques help to make the 
"information" in the retrieved documents more usable for users, for example, by 
pinpointing the answer passages in the retrieved documents, extracting information, 
summarization, supporting the comparison of multiple documents and so on. The 
importance of such technology to make retrieved information immediately exploitable 
by the user is increasing in the Internet environment in which novice end users have 
to face huge amount of heterogeneous information resources. Therefore both IREX 
and NTCIR included both IR task and NLP-related tasks from the beginning. 

In the next section, we outline the Workshops. Section 3 describes the test 
collections used and Section 4 discusses some thoughts on future directions. 



2 Overview of the NTCIR Workshop 



This section introduces the tasks, procedures and evaluation results of the first 
NTCIR. We then discuss the characteristic aspects of CLIR with scientific documents, 
which was a task at the first NTCIR Workshop. 
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2.1 Tasks 

Each participant has conducted one or more of the following tasks at each workshop. 
NTCIR Workshop I 

- Ad Hoc Information Retrieval Task: to investigate the retrieval performance of 
systems that search a static set of documents using new search topics. 

- Cross-Lingual Information Retrieval Task: an ad hoc task in which the documents 
are in English and the topics are in Japanese. 

- Automatic Term Recognition and Role Analysis Task: (1) to extract terms from 
titles and abstracts of documents, and (2) to identify the terms representing the 
"object", "method", and "main operation" of the main topic of each document. 

The test collection NTCIR- 1 was used in these three tasks. In the Ad Hoc 
Information Retrieval Task, the document collection containing Japanese, English and 
Japanese-English paired documents is retrieved by Japanese search topics. In Japan, 
document collections often naturally consist of such a mixture of Japanese and 
English. Therefore the Ad Hoc IR Task at the NTCIR Workshop I is substantially 
CLIR though some of the participating groups discarded the English part and did the 
task as Japanese monolingual IR. 

NTCIR Workshop 2 

- Chinese IR Task: including English-Chinese CLIR (E-C) and Chinese monolingual 
IR (C-C) using the test collection CHIBOl, consisting of newspaper articles from 
five newspapers in Taiwan R.O.C. 

- Japanese-English IR Task: using the test collection of NTCIR- 1 and -2, including 
monolingual retrieval of Japanese and English (J-J, E-E) and CLIR of Japanese and 
English (J-E, E-J, J-JE, E-JE). 

- Text Summarization Challenge: text summarization of Japanese newspaper articles 
of various kinds. The NTCIR-2 Summ collection and TAO Summ Collection are 
used. 

The new challenging task is called "Challenge". Each task or challenge has been 
proposed and organised by different research groups in a rather independent way 
while keeping good contacts and discussion with the NTCIR Project organising group 
headed by the author. How to evaluate and what should be evaluated as a new 
Challenge" has been thoroughly discussed through a discussion group. 

2.2 Participants 

NTCIR Workshop 1. Below is the list of active participating groups that submitted 
task results. Thirty-one groups, including participants from six countries, enrolled to 
participate in the first NTCIR Workshop. Of these groups, twenty-eight groups 
enrolled in IR tasks (23 in the Ad Hoc Task and 16 in the Cross-Lingual Task), and 
nine in the Term Recognition task. Twenty-eight groups from six countries submitted 
results. Two groups worked without any Japanese language expertise. 

Communications Research Laboratory (MPT), Fuji Xerox, Fujitsu Laboratories, 
Hitachi, JUSTSYSTEM, Kanagawa Univ. (2), KAIST/KORTERM, Manchester 
Metropolitan Univ., Matsushita Electric Industrial, NACSIS, National Taiwan 
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Univ., NEC (2 groups), NTT, RMIT & CSIRO, Tokyo Univ. of Technology, 
Toshiba, Toyohashi Univ. of Technology, Univ. of California Berkeley, Univ. of 
Lib. and Inf Science (Tsukuba, Japan), Univ. of Maryland, Univ. of Tokushima, 
Univ. of Tokyo, Univ. of Tsukuba, Yokohama National Univ., Waseda Univ. 

NTCIR Workshop 2. Forty-six groups from eight countries registered for the second 
NTCIR Workshop. Among them, 16 registered for Chinese IR, 30 for Japanese- 
English IR tasks, and 15 for Text Summarization. 

ATT Labs & Duke Univ., Chinese Univ. of Hong Kong, Communications 
Research Laboratory -MPT, Fuji Xerox, Fujitsu Laboratories (2), Gifu Univ., 
Hitachi Co., HongKong Polytechnic, loS, Johns Hopkins Univ., JR Res. Labs, 
JUSTSYSTEM, Kanagawa University, KAIST/KORTERM, Matsushita Electric 
Industrial, Nat. TsinHua Univ., Univ. of Osaka, Nil (3), Univ. of Tokyo (2), 
NEC, New Mexico Univ., NTT & NAIST, OASIS, Queen College-City 
University of New York, Ricoh Co., Surugadai Univ., 

Toshiba/Cambridge/Microsoft, Trans EZ, Toyohashi Univ. of Technology (2), 
Univ. of California Berkeley, Univ. of Electro-Communication (2), Univ. of 
Exeter, Univ. of Lib. and Inf Science (Tsukuba, Japan), Univ. of Maryland, 
Univ. of Montreal , Yokohama National Univ. (2), Waseda Univ. 



2.3 Procedures and Evaluation 
NTCIR Workshop I: 

• November 1, 1998: distribution of the training data (document data, 30 ad hoc 

topics, 21 cross-lingual topics and their relevance assessments) 

• February 8, 1999: distribution of the test data (the 53 new test topics) 

• March 4, 1999: submission of results 

• June 12, 1999: distribution of evaluation results 

• August 30-September 1, 1999: Workshop meeting 

NTCIR Workshop 2: 

• June, 2000: distribution of the training data 

• August 10, 2000: distribution of the test data for the Japanese IR task (new 

documents and 49 J/E topics) 

• August 30, 2000: distribution of the test data for the Chinese IR task (new 

documents and 50 C/E topics) 

• September 8, 2000: dry run in the Summarization task 

• September 18, 2000: submission of results in the Japanese IR task 

• October 20, 2000: submission of results in the Chinese IR task 

• November, 2000: test in the Summarization task 

• January 10, 2001: distribution of evaluation results 

• March 7-9, 2001: Workshop meeting at the Nil in Tokyo. 

A participant could submit the results of more than one run. For IR tasks, both 
automatic and manual query constructions were allowed. In the case of automatic 
construction, the participants had to submit at least one set of results of the searches 
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using only <Description> fields of the topics as the mandatory runs. The intention of 
this is to enhance the cross-system comparison. For optional automatic runs and 
manual runs, any fields of the topics could be used. In addition, each participant had 
to complete and submit a system description form describing the detailed features of 
the system. 

Human analysts assessed the relevance of retrieved documents to each topic. The 
relevance judgments (right answers) for the test topics were delivered to active 
participants who submitted search results. Based on these assessments, interpolated 
recall and precision at 1 1 points, average precision (non-interpolated) over all relevant 
documents, and precision at 5, 10, 15, 20, 30, and 100 documents were calculated 
using TREC's evaluation program, which is available from the ftp site of Cornell 
University. 

For the Text Summarization Task, both intrinsic and extrinsic evaluations have 
been conducted. For the former, emphasis is placed on round-table evaluation and 
creating a reusable data set. Professional captionists created two kind of summaries as 
"right answer"; abstract-type summaries which involved the reorganisation of 
sentences, and extract-type summaries. Each submitted summary was then rated by 
these professional captionists comparing it with those two "right answers" and the 
automatically created random summary of the article. The results will serve as 
reference data for the round-table discussion at the workshop meeting, where all the 
participants share the experience and can have detailed discussion of the technology. 
For the extrinsic evaluation, we chose an IR task based evaluation, which is similar to 
the method used at SUMMAC [6]. 



2.4 Results of the NTCIR Workshop 1 and Discussion 

Recall/precision (R/P) graphs of the top Ad Hoc and top Cross-Lingual runs for all 
runs are shown in Figs. 1 and 2. For further details of each approach, please consult 
the paper for each system in the Workshop Proceedings, which is available online at 
http://www.nii.ac.jp/ntcir/OnlineProceedings/. 

One of the most interesting things found in the IR evaluation is that among the best 
systems, the two systems of JSCB and BK, which took completely different 
approaches, both obtained very high scores. JSCB used NLP techniques very well on 
a vector space model with pseudo relevance feedback and BKJJBIFU focused on the 
statistical approach of weighting algorithms based on long experience with the 
expanding probabilistic model using logistic regression and used simple bi-gram 
segmentation. 

Many groups used weighting schemes that have been reported as working well 
against English documents but have not been tested on Japanese documents. This is 
probably because of shortness of time in the Workshop schedule. Extension of the 
experiments on the weighting schemes is confidently expected. 

Quasi-paired documents of a native language and English such as the ones 
included in the NTCIR-1 & 2 collections can be easily found in the real world, for 
example, on the Web, or in scholarly documents, commercial documents describing a 
company's products, government documents, and so on. Using these documents to 
prepare bilingual or multilingual lexical resources that are usable for cross-lingual 
information access is a practical approach to the problem. 
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Fig. 1. Top Ad Hoc Runs (Level 1) 
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Fig.2. Top Cross-Lingual Runs (Level 1) 
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Transliteration: In the NTCIR Workshop 1, one group used transliteration of 
Katakana (phonetic characters used to represent foreign terms) terms in CLIR, which 
worked well. It seemed to work especially well on technical terms and is expected to 
be effective in reducing the problems caused by word mismatch because of the 
various ways of expression of a concept in Japanese documents, as discussed above. 
More investigation is expected on this matter. 

Round-table Evaluation: We conducted the Term Recognition Task as a round- 
table evaluation. The organiser prepared evaluation results according to a proposed 
evaluation measure called "most common answer" for each submitted result and these 
were used as reference data for the round-table discussion. For term recognition, there 
can be various directions of evaluation criteria according to the purpose of the 
research and application. A single "gold standard" cannot be meaningful for this task. 
Instead, we placed emphasis on sharing ideas on "what is the problem in term 
recognition research", and detailed discussions on the techniques used and their 
purpose. We then discussed further directions for this investigation based on the 
common experience gained through the task at the workshop. 



3 Test Collections 

Through the NTCIR Workshops and its ex-partner (now colleague of NTCIR) IREX, 
the following test collections or data sets usable for laboratory-type testing of IR and 
related test processing technology were constructed. 

□ CHIB-1; more than 130,000 Chinese articles from 5 Taiwan newspapers of 1998 
and 1999. 50 Chinese topics and English translation, 4-grade relevance judgments 

□ NTCIR-1; ca.330,000 Japanese and English documents. 83 Japanese topics, 3- 
grade relevance judgments. A tagged corpus 

□ NTCIR-2; ca.400,000 Japanese and English documents, 49 Japanese topics and 
English translation, 4-grade relevance judgments. The Segmented data 

□ NTCIR-2 Summ; ca.lOO -I- ca.2000 {NTCIR-2 TAO Summ) manually created 
summaries of various types of Japanese articles from Mainichi Newspaper of 1994, 
1995 and 1998. 

□ IREX-IR; ca. 200,000 Japanese newspaper articles from Mainichi Newspaper of 
1994 and 1995, 30 Japanese topics, 3 -grade judgments 

□ IREX-NE; Named entity extraction from Japanese newspaper articles 

A sample document record of the NTCIR-1 is shown in Fig. 3. The documents are 
author abstracts of conference papers presented at academic meetings hosted by 65 
Japanese academic societies. More than half of them are English-Japanese paired. 
Documents are plain texts with SGML-like tags. A record may contain document ID, 
title, a list of author(s), name and date of the conference, abstract, keyword(s) that 
were assigned by the author(s) of the document, and the name of the host society. 

A sample topic record is shown in Fig. 4. Topics as defined as statements of "user 
needs" rather than "queries", which are the strings actually submitted to the system, 
since we would like to allow both manual and automatic query construction from the 
topics. 
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<REC> 

<ACCN>gakkai-000001 1 144</ACCN> 

<TITL TYPE="kanji''>a^JI^g • rsGML^-^lSj LT 

</TITL> 

<TITE TYPE="alpha">Electronic manuscripts, electronic publishing, and electronic library </TITE> 
<AUPK TYPE="kanji''>tI# Ejt</AUPK> 

<AUPE TYPE="alpha">Negishi, Masamitsu</AUPE> 
<CONETYPE=''kanji">-5F^?S#^(tHfR^^M)</CONF> 

<CNFE TYPE="alpha">The Special Interest Group Notes of IPSJ</CNFE> 

<CNFD>1991. 11. 19</CNFD> 

<ABST TYPE="kanji"><ABST.P>a^thtKtl/' 5 ^ — 17— K4■a'L'^'^ iB*, PP 

-5|S||li®|§‘Cfe'5SGML (Standard Generalized Markup Language) I 

tiaits Tsgml^^Sj is J;t>'^:co^3:CD-ROMlK®#fi)c^^& 

jS®PB'3M’C&'9, </ABST.P></ABST> 

<ABSE TYPE="alpha''><ABSE.P>Current situation on electronic processing in preparation, editing, 
printing, and distribution of documents is summarized and its future trend is discussed, with focus on the 
concept: ''Electronic publishing: Movements in the country concerning an international standard for 
electronic publishing. Standard Generalized Markup Language (SGML) is assumed to be important, and 
the results from an experiment at NACSIS to publish an "SGML Experimental Journal" and to make its 
full-text CD-ROM version are reported. Various forms of "Electronic Library" are also investigated. The 
author puts emphasis on standardization, as technological problems for those social systems based on the 
cultural settings of publication of the country, are the problems of acceptance and penetration of the 
technology in the society. </ABSE.P></ABSE> 

<KYWD TYPE="kanji">S^ tb® // // // SGML // tf fHr 3^ — // ^5:7=' 

— X</KYWD> 

<KYWE TYPE="alpha">Electronic publishing // Electronic library // Electronic manuscripts // SGML // 
NACSIS // Full text databases</KYWE> 

<SOCNTYPE="kanji">'fffR®:S'?^^</SOCN> 

<SOCE TYPE="alpha">Information Processing Society of Japan</SOCE> 

</REC>' 



Fig. 3. Sample Document Record in the NTCIR- 1. 

A topic contains SGML-like tags and consists of a title, a description, a detailed 
narrative, and a list of concepts and field(s). The title is a very short description of the 
topic and can be used as a very short query that resembles those often submitted by 
end-users of Internet search engines. Each narrative may contain a detailed 
explanation of the topic, term definitions, background knowledge, the purpose of the 
search, criteria for judgment of relevance, and so on. 



3.1 Relevance Judgments (Right Answers) 

The relevance judgments were undertaken by pooling methods. Assessors and topic 
authors are always the users of the document genre. The relevance judgments were 
conducted using multi-grades: three grades in the NTCIR- 1 and four grades in the 
NTCIR-2 and CHIBOI. We think that multi-grade relevance judgments are more 
natural or close to the judgments done in the real life. To run TREC's evaluation 
program to calculate mean average precision, recall-level precision, document level 
precision, we set two thresholds for the level of relevance. 
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<TOPIC q=0005> 

<TITLE> 

</TITLE> 

<DESCRIPTION> 

9 7^9^) 'y'7'\^ion^wm!K7t y a ^ 

</DESCRIPTION> 

<NARRATIVE> 

hroy?9^y' y iry''y:^-:7 

itiSo r:/yyr-'>3 hroi;kycttS:=P. %J5hfs.i,:ihiSih^, z. 

iEifyc*tt. wt^kysy y^'y^'^a 

fei-ioV'T, JUnHyi^ib, ioT, tt$xy.cifSrfTycCo-rv'5fc®„ HftMa 

g|5t Lr#»7cy 

</NARRATIVE> 

<CONCEPT> 

tf $6®^^, 9 y^ y vy?" 

</CONCEPT> 

<FIELD> 

i.sa- • tnis • $y#p 

</FIELD> 

</TOPIC> 



Fig. 4. Sample Topic Record in the NTCIR-l 



For NTCIR-l and 2, the assessors are researchers in each subject domain since 
they contains scientific documents; two assessors judged the relevance to a topic 
separately and assigned one of the three or four degrees of relevance. After cross- 
checking, the primary assessors of the topic, who created the topic, made the final 
judgment. The TREC's evaluation program was run against two different lists of 
relevant documents produced by two different thresholds of relevance, i.e.. Level 1 , 
in which "highly relevant (S)" and "relevant (A)" are rated as "relevant", and Level 2, 
in which S, A and "partially relevant (B)" were rated as "relevant" though the NTCIR- 
1 does not contain "highly relevant (S)". 

Relevance judgments in the CHIBOl were conducted according to the method 
originally proposed by Lin and her supervisor Kuang-hua Chen, who is one of the 
organisers of the Chinese IR Task at the NTCIR Workshop 2 [7]. Three different 
groups of users; information specialists including librarians, subject specialists, and 
ordinary people conducted judgments separately and assigned to each document one 
of four different degrees of relevance; very relevant (3), relevant (2), partially relevant 
(1) and irrelevant (0). Then, three relevance judgments assigned by each assessor 
were averaged out to between 0 and 1 using the formula below; 

(Assessorl -i- Assessor2 -i- Assessors) / 3 / 3 

The so-called rigid relevance means the final relevance should be between 0.6667 
and 1. This is equivalent to each assessor assigning "relevant (2)" or higher to the 
document, and corresponds to Level 1 in NTCIR-2. The so-called relaxed relevance 
means that the final relevance should between 0.3333 and l.That is to say, it is 
equivalent to each assessor assigning "partial relevant (1)" or higher to the document, 
and corresponds to Level 2 in NTCIR-2. The TREC's evaluation program was run 
against these two levels of relevance. 
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The reason why three different groups of users were employed as assessors is 
because the genre of newspaper articles is used by various kinds of users. The idea of 
averaging out the assessments by different user groups is new compared to the 
traditional approach of test collection building in which the topic author should be the 
most qualified assessor. A similar idea was mentioned by Dr Andrei Broder, Vice 
President for Research and Chief Scientist at Alta Vista, in his invited talk at the 
TREC-9 Conference held on 13-16th of November 2000. He proposed the need to 
average out the relevance judgment of 15 to 20 users in the evaluation of Web search 
engines since the users of the systems are very heterogeneous and systems can not 
know the user's profile during the search. 

Additional Information: In NTCIR-1 and 2, relevance judgment files contain not 
only the relevance of each document in the pool but also contain extracted phrases or 
passages showing the reason why the analyst assessed the document as "relevant". 
Situation-oriented relevance judgments were conducted based on the statement of 
"purpose of search" or "background" in <NARRATIVE> in each topic as well as 
topic -oriented relevance judgments, which are more common in ordinary IR systems 
laboratory testing. However, only topic-oriented judgments are used in the formal 
evaluation of this Workshop. 

Rank-Degree Sensitive Evaluation Metric on Multi-grade Relevance Judgments: In 
the NTCIR Workshop 2, we plan to use a metrics which is sensitive to the degree of 
relevance of the documents and their rank in the ranked list of the retrieved 
documents. Intuitively, the highly relevant documents are more important for users 
than partial relevant ones and the documents retrieved in the higher ranks in the 
ranked list are more important. Therefore the systems producing the search results in 
which higher relevant documents in higher ranks in the ranked list should be rated as 
better. 

Multi-grade relevance judgments are used in several test collections such as Cystic 
Fibrosis [8] and OHUMED [9] though specific evaluation metrics for them were not 
produced for the collection. We are now examining the several rank-degree sensitive 
metrics proposed so far including. Average Search Length [10], Relative Relevance 
and Ranked Half-Life [11], and Cumulated Gains [12], and will then choose or 
propose appropriate measures for our purpose. 



3.2 Linguistic Analysis 

NTCIR-1 contains "Tagged Corpus". This contains detailed hand-tagged part-of- 
speech (POS) tags for 2,000 Japanese documents selected from NTCIR-1. Spelling 
errors are also manually collected. Because of the absence of explicit boundaries 
between words in Japanese sentences, we set three levels of lexical boundaries (i.e., 
word boundaries, and strong and weak morpheme boundaries). In NTCIR-2, the 
segmented data of the whole J (Japanese document) Collection is provided. They are 
segmented into three levels of lexical boundaries using a commercially available 
morphological analyser called HAPPINESS. 
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3.3 Robustness of the IR System Testing Using NTCIR-1 

The test collection NTCIR-1 has been tested from the following aspects so that it can 
be used as a reliable tool for IR system testing: 

(A) exhaustivity of the document pool 

(B) inter-analyst consistency and its effect for system evaluation 

(C) topic-by-topic evaluation. 

The results of these studies have been reported and published on various occasions 
[13-16]. As a result, in terms of exhaustiveness, pooling the top 100 documents from 
each run worked well for topics with fewer than 50 relevant documents. For topics 
with more than 100 relevant documents, although the top 100 pooling covered only 
51.9% of the total relevant documents, coverage was higher than 90% if combined 
with additional interactive searches. Therefore, we decided to use the top 100 pooling 
and conducted additional interactive searches for topics with more than 50 relevant 
documents. 

We found a strong correlation between the system rankings produced using 
different relevant judgments and different pooling methods, regardless of the 
inconsistency of the relevance assessments among analysts and regardless of the 
different pooling methods [13-15]. A similar analysis has been reported by Voorhees 
[17]. We concluded that NTCIR-1 is reliable as a tool for system evaluation based on 
these analyses. 



4 Future Directions 

In the future, we would like to enhance the investigation in the following directions; 

1 . Evaluation of CLIR systems including Asian languages 

2. Evaluation of retrieval of new document genres 

3. Evaluation of technology to make retrieved information immediately usable 

One of the problems of CLIR is the availability of resources that can be used for 
translation [18-19]. Enhancement of creating and sharing the resources is important. 
In the NTCIR Workshops, some groups automatically constructed a bilingual lexicon 
from quasi-paired document collection. We ourselves also conducted research on 
CLIR using automatically generated bilingual keyword clusters based on graph-theory 
[20]. Such paired documents can be easily found in non-English speaking countries 
and on the Web. Studying the algorithms to construct such resources and sharing 
them is one practical way to enrich the applicability of CLIR. International 
collaboration is needed to construct multilingual test collections and organising 
evaluation of CLIR since creating topics and relevance judgments are language- and 
cultural-dependent, and must be done by native speakers. With respect to new genres, 
we are especially interested in Web documents and multimedia documents. For these 
document types, the user group, usage, purpose of search, criteria for successful 
retrieval are quite different than the ones for traditional text retrieval and the 
investigation of these aspects is challenging. 
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Abstract. Language resources such as machine dictionaries and lexical 
databases, aligned parallel corpora or even complete machine transla- 
tion systems are essential in Cross-Language Text Retrieval (CLTR), al- 
though not standard tools for the Information Retrieval task in general. 
We outline the current use and adequacy for CLTR of such resources, 
focusing on the participants and experiments performed in the CLEF 
2000 evaluation. Our discussion is based on a survey conducted on the 
CLEF participants, as well as the descriptions of their systems that can 
be found in the present volume. We also discuss how the usefulness of the 
CLEF evaluation campaign could be enhanced by including additional 
tasks which would make it possible to distinguish between the effect on 
the results of the resources used by the participating systems, on the one 
hand, and the retrieval strategies employed, on the other. 



1 Introduction 

Broadly speaking, traditional Information Retrieval (IR) has paid little atten- 
tion to the linguistic nature of texts, keeping the task closer to a string pro- 
cessing approach rather than a Natural Language Proeessing (NLP) one. To- 
kenization, removal of non-content words and crude stemming are the most 
“language-oriented” IR tasks. So far, more sophisticated approaches to index- 
ing and retrieval (e.g. phrase indexing, semantic expansion of queries, etc.) have 
generally failed to produce the improvements that would compensate for their 
higher computational cost. As a consequence, the role of language resources in 
standard text retrieval systems has remained marginal. 

The Cross-Language Information Retrieval (CLIR) challenge - in which que- 
ries and documents are stated in different languages - is changing this landscape: 
the indexing spaces of queries and documents are different, and the relationships 
between them cannot be captured without reference to cross-linguality. This 
means that Language Engineering becomes an essential part of the retrieval pro- 
cess. As the present volume attests, research activities in CLIR include the devel- 
opment, adaptation and merging of translation resources; the study of methods 
to restrict candidate terms in query translation; the use of Machine Transla- 
tion (MT) systems, in isolation or (more commonly) in combination with other 
strategies, etc. 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 36-El 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 



Language Resources in Cross-Language Text Retrieval: A CLEF Perspective 



37 



In this paper, we will study the use of Language Resources by groups par- 
ticipating in CLEF 2000, assuming that this provides a representative snapshot 
of the research being conducted in CLIR as a whole. We will use “language re- 
sources” in its broadest sense to include not only dictionaries and corpora but 
also Natural Language Processing tools (stemmers, morphological analyzers and 
compound splitters, MT systems, etc.). 

The next section summarizes the language resources, and their current ca- 
pabilities and shortcomings, used in the first CLEF campaign. In Section 3 we 
propose possible ways to complement the current CLEF evaluation activity to 
take into account the balance between the quality of language resources, on one 
hand, and cross-language retrieval techniques, on the other. The final section 
briefly extracts some conclusions. 

2 Language Resources in CLEF 2000 

We have collected information about the language resources and tools employed 
in the first CLEF campaign, using two sources of information: a survey conducted 
on the CLEF participants, and the papers contained in the present volume. 

The survey was sent to all participants in CLEF, and we received 14 re- 
sponses. The teams were asked to list the resources used (or tested) in their 
CLTR system, specifying the provider, the availability and the approximate 
size/coverage of the resource. They were also asked a) whether the resources 
were adapted/enriched for the experiment, and how; b) what were the strengths 
and limitations of the resources employed; and c) their opinion about key is- 
sues for future CLTR resources. Finally, we scanned the descriptions of systems 
contained in the present volume to complete the information obtained in the 
responses to the survey. 

We have organized language resources into three groups: dictionaries (from 
bilingual word pair lists to lexical knowledge bases), aligned corpora (from the 
Hansard corpus to data mined from the web) and NLP software (mainly MT sys- 
tems, stemmers and morphological analyzers). Before discussing in more depth 
each of these three categories, some general observations can be made: 

— More than 40% of the resources listed have been developed by the partici- 
pants in the CLIR evaluation. This is a strong indication that CLEF is not 
just evaluating CLIR strategies built on top of standard resources, but also 
evaluating resources themselves. 

— Only 5 out of 34 resources are used by more than one group: a free dic- 
tionary {Freedict^), a web-mined corpus i IF4(7|?T|L an online MT service 
{Bahelfish^) , a set of stemmers (Muscat^) and an automatic morphology 
system ( Automorvholoov\l^) . This is partially explained by the fact that 
many participants use their own resources, and there are only two cases 
of effective resource sharing: the web-mined corpus developed by U Mon- 
treal/RALI (three users including the developers) and the Automorphology 
system developed by the U. of Chicago (used also by the U. Maryland group 
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Languages 


developer / provider 


size 


teams 


EN,GE,FR,IT 


lAI 


EN 40K, GE 42K 


lAI 






FR 33K, IT 28K 





EN-GE,FR,IT 


lAI 


EN/FR 39K, EN/GE 46K, 
EN/IT 28K 


lAI 


NL-EN 


Canadian web company 


? 


Syracuse U. 


NL-EN,GE,FR,IT 


www.travlang.com/Ergane NL 56K, EN 16K, FR lOK 

GE 14K, IT 4K 


CWI, 

U Montreal/RALI 


EN-GE 


WWW . quickdic . de 


EN 99K, GE 130K 


U. Maryland 


EN-FR 


WWW. freedict . com 


EN 20K, FR 35K, 


U. Maryland, 
U. Glasgow 


EN-IT 


WWW. freedict . com 


EN 13K, IT 17K 


U. Maryland, 
U. Glasgow 


EN-GE 


WWW. freedict . com 


88K 


IRIT 


EN-GE 


WWW . leo .online 


224K 


U. Dortmund 


FI,SW,GE^EN 


? 


lOOK 


U. Tampere 


GE-EN 


? 


? 


Eurospider 


EN-FR 


Termium 


IM per lang. 


U. Montreal/RALI 


GE-FR,IT 


Eurospider sim. thesauri 


? 


Eurospider 



GE-EN-SP-NL 

IT-FR-GZ-ET 


EuroWordNet /ELRA 


EN 168K, IT 48K, 
GE 20K, FR 32K 


U. Sheffield 


EN/GE/NL 


CELEX/LDG 


5IK lemmas 


U. Sheffield 


NL-GE,FR, 

EN,SP 


VLIS/Van Dale 


lOOK lemmas 


TNO/Twente 



Table 1. Dictionaries and lexical databases 



— The coverage and quality of the resources are very different. In general, 
the participating teams found that good resources (in coverage, consistency, 
markup reliability, translation precision, richness of contextual information) 
are expensive, and free resources are of poor quality. With a few (remarkable) 
exceptions, better resources seem to lead to better results. 

— Of all the “key issues for the future”, the one quoted most often by CLEF 
participants was simply “availability” and sharing of lexical resources. This 
is partially explained by the points mentioned above: 

• many resources used in CLEF are developed by the participants them- 
selves, and it is not clear whether they are accessible to other researchers 
or not, except for a few cases. 

• a general claim is that good resources (especially dictionaries) are ex- 
pensive, and freely available dictionaries are poor. 

• the diversity and minimal overlapping of the resources used by CLEF 
participants indicate lack of awareness of which resources are available 
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and what is their cost/benefit for CLIR tasks. Hopefully, the CLEF 
activities should provide an excellent forum to overcome many of these 
difficulties. 

— Two trends seem to be consolidating: 

• The lack of parallel corpora is being overcome, in corpus-based ap- 
proaches, either by mining the web (U Montreal/RALI or by using 
comparable corpora (Eurospider |12|L 

• The distinction between corpus-based and dictionary-based approaches 
is becoming less useful to classify CLIR systems, as they tend to merge 
whatever resources are available. U Montreal/RALI, Eurospider, TNO/ 
Twente [THj, IRIT jnj systems are examples of this tendency. 



2.1 Dictionaries 

It is easy to imagine the features of an ideal dictionary for CLIR: wide cover- 
age and high quality, extensive information to translate phrasal terms, trans- 
lation probabilities, domain labels, rich examples of usage to permit contex- 
tual disambiguation, domain-specific extensions with coverage of named enti- 
ties, semantically-related terms, clean markup ... In general, such properties are 
listed by CLEF participants as features that are lacking in present resources and 
desirable features for future CLIR resources. 

In practice, 14 different lexical resources were used by the 18 groups partici- 
pating in CLEF this year (see Table 1). They are easier to obtain and use than 
aligned corpora and thus their use is more generalized. The distinctive feature 
of the dictionaries used in CLEF is their variety: 

— Under the term “dictionary” we find a whole range of lexical resources, from 
simple lists of bilingual word pairs to multilingual semantic databases such 
as EuroWordNet. 

— In most cases, however, the lexical knowledge effectively used by the CLEF 
systems is quite simple. Definitions, domain labels, examples of usage, se- 
mantically related terms, are examples of lexical information that are hardly 
used by CLEF participants. Information on translation probabilities, on the 
other hand, is something that the dictionaries did not provide and would 
have been used by many teams, according to the survey. 

— The size of the dictionaries used also covers a wide spectrum: from the 4000 
terms in the Italian part of the Ergane dictionary [S| to the 1 million terms 
per language in the Termium database 0 used by the U Montreal/RALI 
group. Sizes that differ by more than two orders of magnitude! 

— Some of them (four at least) are freely available in the web; two are ob- 
tainable via ELRA P) (European Language Resources Association) or LDC 
(Linguistic Data Consortium) [[]; one is distributed by a publishing company 
(Van Dale) and at least three have a restricted distribution. 
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— Only one dictionary is used by more than one group (Preedict in its English- 
French and English-Italian versions). As has already been pointed out, this 
is a strong indication that sharing resources/knowledge about resources is 
not yet a standard practice in the CLIR community. 

— As could be expected, the more expensive the resource, the higher its quality 
and coverage and the better the results, in the opinion of the participants. 
Freely available dictionaries tend to be the most simple and noisy, and have 
lower coverage. 

Table 1 does not include the GIRT thesaurus, which was provided to all par- 
ticipants in the specific-domain retrieval task. UC Berkeley m, for instance, 
used this social sciences bilingual thesaurus to produce a domain specific trans- 
lation list; the list was used, together with a generic bilingual dictionary for 
uncovered words, to produce better results than an MT approach. This is an 
interesting result that shows that, although thesauri are not considered as lexi- 
cal resources per se, they can be successfully adapted for translation purposes. 
The similarity thesaurus included in Table 1 was derived automatically from 
comparable corpora (see below). 



2.2 Aligned Corpora 

Only 5 aligned corpora were used by CLEF participants, mainly by the JHU / APL 
group (see Table 2). Most of them are domain-specific (e.g. the Hansard cor- 
pus jn| or the United Nations corpiis[Tnj) and not particularly well suited to 
the CLEF data. Obviously the lack of aligned corpora is a major problem for 
corpus-based approaches. However, the possibility of mining parallel web pages 
seems a promising research direction, and the corpora and the mining software 
developed by U Montreal/RALI and made freely available to CLEF participants 
have been used by more groups than any other resource (U Montreal/RALI, 
JHU/ APL p:g, IRIT, TNO/Twente). 



Resource 


Languages developer/provider 


size 


teams 


WAG 


FR,EN, 


U. Montreal/RALI 


lOOMB per lang. U. 


Montreal /R ALI , 


(web corpus) 


IT,GE 






JHU/APL, 










IRIT 


web corpus 


EN/NL 


TNO/Twente 


3K pages 


TNO/Twente 


Hansard 


EN-FR 


LDG 


3M sentence pairs 


JHU/APL 


UN 


EN-SP-FR 


LDG 


50K EN-SP-FR docs 


JHU/APL 


JOG 


EN-FR- 

SP-IT-GE 


ELRA 


lOK sentences 


JHU/APL 



Table 2. Aligned Corpora 
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Resource 


Languages 


developer /provider 


teams 


babelfish.altavista.com 


EN,FR,GE,IT,SP 


Altavista/Systran 


U. Dortmund, 








U. Salamanca, 








U.C. Berkeley 


Systran MT system 


EN-FR,GE,IT 


Systran 


JHU/APL 


L&H Power Translator Pro 7.0 


EN-FR,GE,IT 


Lernout & Hauspie 


U.C. Berkeley 



stemmers 


EN,GE,FR 


open.muscat.com 


CWI, 


stemmers 


IT,NL 

IT,FR,GE 


U.C. Berkeley 


West Group 
U.C. Berkeley 


(from assoc, die.) 
ZPRISE stemmers 


FR,GE 


NIST 


U. Glasgow 


stat. stemmer 


FR,GE, 


U. Chicago, 


U. Maryland 


Spider stemmers 


IT,EN 

FR,IT,GE 


U. Maryland 
Eurospider 


Eurospider 



Automorphology 


EN,GE, 

IT,FR 


U. Chicago 


U. Chicago, 
U. Maryland 


morph, analyser 


FIN,GE, 

SWE,EN 


LINGSOFT 


U. Tampere 


compound splitter 


NL 


Twente 


CWI/Twente 


MPRO morph, anal. 


GE 


lAI 


lAI 


stemmers based on 
morph, anal. 


FR,GE 


? 


West Group 



morph, analyser/ 


IT 


ITC-IRST 


ITC-IRST 


POS tagger 








grammars 


EN,IT, 


lAI 


lAI 




GE,FR 







Table 3. NLP software 



Besides parallel corpora, a German/Italian/French comparable corpus con- 
sisting on Swiss national news wire, provided by SDA (Schweizerische Depesche- 
nagentur) was used to produce a multilingual similarity thesaurus j1 2j . The 
performance of this thesaurus and the availability of comparable corpora (much 
easier to obtain, in theory, than parallel corpora) makes such techniques worth 
pursuing. 

Overall, it becomes clear that corpus-based approaches offer two advantages 
over dictionaries: a) they make it possible to obtain translation probabilities 
and contextual information, which are rarely present in dictionaries, and b) 
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they would provide translations adapted to the searching domain, if adequate 
corpora were available. The practical situation, however, is that aligned transla- 
tion equivalent corpora are not widely available, and are very costly to produce. 
Mining the web to construct bilingual corpora and using comparable corpora 
appear to be promising ways to overcome such difficulties, according to CLEF 
results. 



2.3 NLP Software 

Stemmers, morphological analyzers and MT systems have been widely used by 
the participants. The list of tools can be seen in Table 3. Some results are worth 
pointing out: 

— The best groups in the German monolingual retrieval task all did some kind 
of compound analysis, confirming that morphological information (beyond 
crude stemming) may be crucial for languages with a rich morphology. Vari- 
ants of the Porter stemmer for languages other than English are, according 
to CLEF participants, much less reliable than the original English stemmer. 

— The best monolingual results for the other languages in the monolingual 
task, Italian and French, are obtained by two groups that concentrated on 
monolingual retrieval (IRST m and West Group EDI) and applied extensive 
lexical knowledge: lexical analysis and part-of-speech tagging in the case of 
IRST, and lexicon-based stemming in the case of West Group. 

— Automatic stemming learned from corpora and association dictionaries ap- 
pears as a promising alternative to stemmers a la Porter. Three groups 
(Chicago, UC Berkeley and Maryland) tested such techniques in CLEF 2000. 

— MT systems are the only language resources that are not mainly developed 
by the same groups that participate in the CLEF evaluation. All the MT 
systems used are commercial systems: the free, online version of Systran 
software (babelfish), a Systran MT package and a Lernout & Hauspie version 
of the Power Translator. 

3 Language Resources in CLIR Evaluation 

Systems competing in CLEF and TREC multilingual tracks usually make two 
kinds of contributions: the creation/adaptation/combination of language re- 
sources, on one hand, and the development of retrieval strategies making use 
of such resources, on the other hand. A problem of CLEF tasks is that they 
are designed to measure overall system performance. While the results indicate 
promising research directions, it is harder to discern which language resources 
worked better (because they were tested with different retrieval strategies) and 
it is also unclear what were the best retrieval strategies (as they were tested 
using different language resources). Of course, the main evaluation task should 
always be an overall task, because a good resource together with a good retrieval 
strategy will not guarantee a good overall system (for instance, the resource may 
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not be compatible with the kind of information demanded by the retrieval al- 
gorithm). But CLEF could perhaps benefit from additional tracks measuring 
resources and retrieval strategies in isolation. In the rest of this section, we list 
some possibilities: 



3.1 Task with a Fixed Monolingual IR System 

A frequent approach to CLIR by CLEF participants is to translate the queries 
and/or documents and then perform a monolingual search with an IR system 
of their own. A wide range of IR systems are used in CLEF, from vector model 
systems to n-gram language models and database systems. This produces a dif- 
ferent monolingual retrieval baseline for each individual group, making it hard 
to compare the cross-language components of each system. 

A possible complementary task would be to ask participants to generate 
queries and/or document translations, and then feed a standard system (e.g. the 
Zprise system provided on demand by NIST to participants) with monolingual 
runs. A substantial number of participants would probably be able to provide 
such translations, and the results would shed some additional light on CLEF 
results with respect to the translation components used. 



3.2 Task with Fixed Resources 

A track in which all participants use the same set of language resources, provided 
by the CLEF organization, would make it possible to compare retrieval algo- 
rithms that participate in the main tracks with different resources. Ideally, CLEF 
could cooperate with the European Language Resources Association (ELRA) to 
provide a standard set of resources covering (at least) the languages included in 
the multilingual track. We see some obvious benefits: 

— Such standard resources would enormously facilitate the participation in the 
multilingual track for groups that need to scale up from systems working on 
a specific pair of languages. 

— A track of this type would highlight promising retrieval strategies that are 
ranked low simply because they are tested with poor resources. 

What kind of resources should be made available? There is no obvious answer, 
in our opinion, to this question. Fixing a particular type of language resource 
will restrict the potential number of participating systems, while providing all 
kinds of resources will again make the comparison of results problematic. 

^From the experience of CLEF 2000, it seems reasonable to start with a 
multilingual dictionary covering all languages in the multilingual track, or a set 
of bilingual dictionaries/translation lists covering a similar functionality. In its 
catalogue, ELRA offers at least two resources that would fit the requirements 
for the CLEF 2001 multilingual track (which will include English, Spanish, Ger- 
man, Italian and French): One is a basic multilingual lexicon with 30000 entries 
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per language, covering the five languages in the multilingual track |2| . This dic- 
tionary has already been evaluated for CLIR purposes in nn. The other one is 
the EuroWordNet lexical database, which offers interconnected wordnets for 8 
European languages in a size range (for the five languages in the multilingual 
task) between 20000 word meanings for German and 168000 for English m- 
EuroWordNet was used in CLEF 2000 by the Sheffield group 1151 . 



3.3 Task with a Large Set of Queries 

In a real world application, the coverage of query terms by the language resources 
is essential for the response of a system. Coverage, however, is poorly measured 
in CLEF for a majority of systems that do query translation: the whole set 
of queries (summing up title, description and narrative) contain only a couple 
of thousand term occurrences (including stop words), and the results are quite 
sensitive to the ability to provide translations for a few critical terms. In addition, 
many relevant problems in cross-language retrieval systems are under represented 
in current queries. 

As an example, let us consider a system that makes a special effort to provide 
adequate translations for proper nouns. This tends to be a critical issue in the 
newspapers domain, where a high percentage of queries include, or even consist 
of, this type of terms. Figure 1 gives a snapshot of queries to the EFE newswire 
database that reflects the importance of proper nouns 0. However, the set of 40 
queries in CLEF 2000 only contains three names of people (’’Pierre Beregovoy”, 
”Luc Jouret” and ’’Joseph di Mambro”) with a total of five occurrences, less 
than 0.1 occurrences per query. 



Jul 26 08:33:49 2000; (Joaquin garrigues walker) 

Jul 26 08:34:34 2000; (descenso and moritz) 

Jul 26 08:34:52 2000; (convencion republicana) 

Jul 26 08:38:32 2000; (baloncesto real-madrid) 

Jul 26 08:38:37 2000; (caricom) 

Jul 26 08:38:41 2000; SHA REZA PAHLEVI 
Jul 26 08:38:43 2000; SHA REZA PAHLEVI 
Jul 26 08:38:45 2000; SHA REZA PAHLEVI 
Jul 26 08:38:54 2000; (noticias internacional ) 

,Iul 26 08:40:18 2000; (CONCORDE) 

,Iul 26 08:40:34 2000; (DOC) AND (CONCORDE) 

Jul 26 08:42:31 2000; (MANUEL FERNANDEZ ALVAREZ) 



Fig. 1. A 9 minute snapshot of EFE news archive search service 



^ EFE is the provider of Spanish data for the CLEF 2001 campaign 
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Another example is a system that tries to find variants and translations for 
named entities in general. In the CLEF 2000 queries, there are approximately 31 
terms (excluding geographical names) that can be associated with named enti- 
ties, such as “Electroweak Theory” or “Deutsche Bundesbahn” . This represents 
only around 0.1% of the total number of terms. 

A final example can be the ability of the resources to translate certain 
acronyms, such as “GATT”. There are 5 acronyms in the collection (exclud- 
ing country names); its coverage may affect the final results, but this variation 
will not be representative as to how well the resources used cover acronym trans- 
lation. 

It is impractical to think of a substantially larger set of queries for CLEF that 
is representative of every possible query style or cross-language issue. However, 
a practical, compromise would be to use a multilingual aligned corpora (such as 
the UN corpus) with documents containing a summary or a descriptive title. The 
titles or the summaries could be used as queries to retrieve the corresponding 
document in a known-item retrieval task. Obviously, such a task is no closer 
to real world IR than CLEF or TREC ad-hoc queries, but it would produce 
useful complementary information on the performance consistency of systems 
on a large query vocabulary, and would probably leave room to test particular 
issues such as proper noun translation or recognition of named entities. 



4 Conclusions 

The systems participating in CLEF 2000 provide a representative snapshot on 
language resources for CLIR tasks. From the reported use of such resources in 
CLEF, together with the results of a survey conducted on the participant groups, 
some interesting conclusions can be drawn: 

— There is a wide variety (in type, coverage and quality) of resources used in 
CLIR systems, but little reuse or resource sharing. CLEF campaigns could 
provide a key role in improving availability, dissemination and sharing of 
resources. 

— Corpus-based approaches, which were less popular due to the lack of parallel 
corpora, are successfully employing web-mined parallel corpora and compa- 
rable corpora. 

— The distinction between corpus-based and dictionary-based approaches is be- 
coming less useful to classify CLIR systems, as they tend to merge whatever 
resources are available. 

— Richer lexical analysis seems to lead to better monolingual results in lan- 
guages other than English, although the difference is only significant for 
German, where decompounding is essential. 

— System builders devote a significant part of their efforts to resource building. 
Indirectly, CLEF campaigns are also evaluating such resources. We have pro- 
posed three complementary tasks that would reflect the systems/resources 
duality in CLIR better than a single, overall retrieval task: a) a task with a 
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fixed monolingual IR system, fed with query translations provided by par- 
ticipants; b) a task with fixed resources provided by CLEF; c) a task with 
a large set of queries to provide a significant number of cases for relevant 
CLIR problems (e.g. proper nouns or vocabulary coverage). 
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Abstract. This paper describes the domain-specihc cross-language in- 
formation retrieval (CLIR) task of CLEF, why and how it is important 
and how it differs from general cross-language retrieval problem asso- 
ciated with the general CLEF collections. The inclusion of a domain- 
specihc document collection and topics has both advantages and disad- 
vantages 



1 Introduction 

For the past decade, the trend in information retrieval test-collection develop- 
ment and evaluation has been toward general, domain-independent text such as 
news wire information. This trend has been fostered by the needs of intelligence 
agencies and the non-specific nature of the World Wide Web and its indexing 
challenges. The documents in these collections (and in the general CLEF collec- 
tions) contain information of non-specific nature and therefore could potentially 
be judged by anyone with good general knowledge. 

Critics of this strategy believe that the tests are not sufficient to solve the 
problems of more domain-oriented data collections and topics. Particularly for 
cross-language information retrieval, we may have a vocabulary disconnect prob- 
lem since the vocabulary for a specific area may not exist in a Machine Transla- 
tion (MT) system used to translate queries or documents. Indeed, the vocabulary 
may have been redefined in a specific domain to mean something quite differ- 
ent from its general meaning. The rationale of the inclusion of domain specific 
collections into the tests is to test retrieval systems on another type of docu- 
ment collection, serving a different kind of information need. The information 
provided by these domain specific documents is far more targeted than news 
stories. Moreover, the documents contain quite specific terminology related to 
the respective domain. The hypothesis to be tested is whether domain-specific 
enhancements to information retrieval provide (statistically significant) improve- 
ment in performance over general information retrieval approaches. 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 48-IHTl 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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Information retrieval has a rich history of test collections, beginning with 
Cranfield, which arose out of the desire to improve and enhance search of sci- 
entific and technical literature. The GIRT collections (defined below) of the 
TREC-8 evaluation and of this first CLEF campaign provide an opportunity 
for IR to return to its roots and to illuminate those particular research prob- 
lems and specific approaches associated with domain-specific retrieval. Other 
recent examples of domain specific collections are the OHSUMED collection II 01 
for the medical domain and the NTCIR collection^^l for science and engineer- 
ing. The OHSUMED collection has been explored for its potential in query 
expansion ^3, and was utilized in the filtering track of the TREC-9 con- 
ference (see http://trec.nist.gov). The NTCIR collection is the first major test 
collection in Japanese and the NTCIR evaluations have provided the first large- 
scale test of Japanese-English crosss language information retrieval. 



2 Advantages and Disadvantages of Domain Specific 
CLIR 

A domain-specific language requires appropriate indexing and retrieval systems. 
Recent results clearly show this difficulty of differentiating between domain- 
specific (in this case: sociological) terms and common language terms: “words 
[used in sociology] are common words that are [also] in general use, such as com- 
munity or immigrant” 0. In many cases there exists a clear difference between 
the scientific meaning and the common meaning. Furthermore, there are often 
considerable difference between scientific terms when used in different domains, 
owing to different connotations, theories, political implications, ethical convic- 
tions, and so on. This means that it can be more difficult to use automatically 
generated terms and queries for retrieval. For example, Ballesteros and Croft 
have noted, for a dictionary-based cross-language query system: “queries con- 
taining domain-specific terminology which is not found in general dictionaries 
were shown to suffer an additional loss in performance” . In some discipline (for 
instance in biology) different terminologies have evolved in quite narrow sub- 
fields as Chen at al.p| have shown for the research dealing with the species of 
worms and flies and their diverging terminology. 

For several domains Haas P] has carried out in-depth-research and stated: 
“T tests between discipline pairs showed that physics, electrical engineering, and 
biology had significantly more domain terms in sequences than history, psychol- 
ogy, and sociology (...) the domains with more term sequences are those which 
may be considered the hard sciences, while those with more isolated domain 
terms tend to be the social sciences and humanities.” 

Nevertheless, domain specific test collections offer new possibilities for the 
testing of retrieval systems as they allow the domain specific adjustment of 
the system design and the test of general solutions for specific areas of usage. 
Developers of domain specific CLIR systems need to be able to tune their systems 
to meet the specific needs of a more targeted user group. 
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The users of domain specific collections are typically interested in the com- 
pleteness of coverage. They may not be satisfied with finding just some relevant 
documents from a collection. For these users the situation of too much overlap 
between the relevant documents within the result sets of the different evaluated 
systems is much more important and has to be solved. 

3 Domain Specific Evaluation Procedures 

Domain-specificity has consequences not only for the data but also for the topic 
creation and assessment processes. Separate specific topics have to be created 
because the data are very different from that found in newspapers or newswires. 
The GIRT documents treat more long-term societal or scientific problems in an 
in-depth manner; current problems or popular events (as they are represented in 
news articles) are dealt with after some time lag. Nevertheless, the TREC/CLEF 
domain-specific task attempted to cover German newswire and newspaper arti- 
cles as well as the GIRT collection. Thus topics were developed which combined 
both general and domain specific characteristics. It proved to be challenging to 
discover topics which would retrieve news stories as well as scientific articles. 

The topic developers must be familiar with the specific domain as well as 
the respective language in which the topic has been created or into which the 
topic is to be translated. The same is true for the assessors - they must have do- 
main related qualifications and sufficient language skills to develop the relevance 
judgements. 

Therefore each domain specific sub-task needs its own group of topic de- 
velopers and relevance assessors in all languages used for the sub-task. Finally 
the systems being tested must be able to adjust general principles for retrieval 
systems to the domain-specific area. 

4 The GIRT Domain-Specific Social Science Test 
Collection 

The TREG-7, TREG-8 and GLEF 2000 evaluations have offered a domain spe- 
cific subtask and collection for GLIR in addition to the generally used collec- 
tions. The test collection for this domain specific subtask is called GIRT (Ger- 
man Information Retrieval Test database) and comes from the social sciences. 
It has been used in several German tests of retrieval systems |S1 I2j The 
GIRT collection was made available for research purposes by the Information- 
sZentrum Sozialwissenschaften (IZ; = German Social Sciences Information Gen- 
tre), Bonn. For pre-test research by the IZ and the University of Konstanz a 
first version, the GIRTl collection contained about 13,000 documents. For the 
TREG7 and TREG8 evaluations, the GIRT2 collection was offered which in- 
cluded GIRTl supplemented with additional documents and contained about 
38,000 documents. In the GLEF2000 campaign the GIRT3 collection was used 
which included the GIRT2 data and additional sampled documents for a total of 
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about 76,000 documents. Figure 1 presents a sample document from the GIRTS 
collection. 



<DOCNO>1994Q100925</DOCNO> 

<TITLE>Psychisch kranke Mitarbeiter in Betrieben : die Sichtweise der betrieblichen 
Helfer</TITLE> 

<TITLE-ENG>Mentally ill employees in companies : the viewpoint of company assistants</TITLE- 
ENG> 

<AUTHOR>Schubert, Andreas</AUTHOR> 

<PUBLICATION-YEAR>1988</PUBLICATION-YEAR> 

<LANGUAGE>DE</LANGUAGE> 

<CONTROLLED-TERM>psychische Krankheit, Mitarbeiter, Detrieb, Heifer, soziales 
Netzwerk,Bezugsperson,Integratlon</CONTROLLED-TERM> 

<CLASSIFICATION>Industriesoziologie, Betriebssoziologie, Arbeitssoziologie, industrielle 
Beziehungen.soziale Probleme,Sozialpolitik</CU^S5IFICATION> 

<TEXT>“Ausgehend von der auBerst problematischen Situation psychisch kranker und 
behinderter Menschen auf dem allgemeinen Arbeitsmarkt wird die besondere Bedeutung 
innerbetrieblicher Hilfen dargestellt. Dazu wird modellhaft die Situation eines Mitarbeiters mit 
*seelischen Problemen' in einem Betrieb skizziert, um somit die potentiellen Bezugspersonen 
und damit ein mogliches innerbetriebliches soziales Netzwerk zu kennzeichnen. Die 
Fragestellung der dargestellten Untersuchung 1st, inwieweit die per Gesetz zur Unterstiitzung 
Behinderter und damit auch psychisch behinderter Mitarbeiter verpflichteten 'betrieblicher 
Heifer', diese Funktion tatsachlich wahrehmen, d.h. inwieweit das Hilfspotential dieser 
Gruppe sich umsetzt in ein fiir den Betroffenen erfahrbares innerbetriebliches soziales 
Netzwerk. Dazu werden die Ergebnisse einer schriftllchen Befragung von 144 betrieblichen 
Helfern referiert. Als Fazit der Untersuchung muB von einem relativ geringen Kenntnisstand 
betrieblicher Heifer bzgi. der Auswirkungen psychischer Krankheit ausgegangen werden, von 
negativen Einschatzungen der Leistungs- und Integrationsmbglichkeiten psychisch 
behinderter Mitarbeiter und von einer starken Tendenz dieser Gruppe, die Problematik und 
damit die Betroffenen auszugrenzen Oder, bei betriebsinternen Vorfallen, an betriebliche 
Entscheidungstrager wie direkte Vorgesetzte, Personal- und Betriebsleitung ‘abzuschieben’. 
Da haufig weder interne noch externe Fachleute hinzugezogen werden, 1st der Aufbau eines 
innerbetrieblichen Netzvverkes als sehr schwierig einzuschatzen. Positive Beispiele belegen 
allerdings die Integrationsmoglichkeiten fiir psychisch Behinderte auch in 'normalen' 
Betrieben." ( Autorenreferat)</TEXT> 

<TEXT-ENG>"Because of the extremly problematical situation of psychologically disturbed 
people so far as the job market is concerned this paper stresses the importance of help inside 
the concerns. In order to show potential sources of help and thus a possible supportive 
network inside a firm a model case of a worker with 'psychological problems' is sketched. 

This investigation was aimed at discovering how far the legal obligation to assist handicapped 
people inside industrial concerns, and thus also psychologically handicapped workers, is 
actually fulfilled by the 'industrial helpers’, i.e. how far the potential help offered by these 



Fig. 1. GIRT Sample document (English text truncated) 



The GIRT data have been collected from two German databases offered com- 
mercially by the IZ via traditional information providers (STN International, 
GBI, DIMDI) and on GD-ROM (WISO III): FORIS (descriptions of social sci- 
ences current research projects in the German speaking countries), and SOLIS 
(references of social sciences literature originated in German speaking countries, 
containing journal articles, monographs, articles in collections, scientihc reports, 
dissertations). The FORIS database contains about 35,000 documents on current 
and hnished research projects of the last ten years. As projects are living objects 
the documents are often changed; thus, about 6,000 documents are changed or 
newly entered each year. SOLIS contains more than 250,000 documents with a 
yearly addition of about 10,000 documents. 

The GIRTS data contain selected bibliographical information (author, lan- 
guage of the document, publication year), as well as additional information ele- 
ments describing the content of the documents: controlled indexing terms, free 
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terms, classification texts, and abstracts (TEXT) - all in German (GIRTl and 
GIRT2 data contained some other fields) . Besides the German information there 
are English translations of the titles (for 71% of the documents) available. For 
some documents (about 8%) there are also English translations of the abstracts 
(TEXT-ENG). One exception is the TITLE field where the original title of the 
document is stored: in some cases the original title has already been English, 
thus, no English translation has been necessary and the field TITLE-ENG is 
missing, although the title is in fact English. The information elements of the 
GIRT collection are quite similar to those of the OHSUMED collection which 
has been developed by William Hersh uni for the medical domain, but that 
test collection is bigger (348,566 documents). The OHSUMED fields are: title, 
abstract, controlled indexing term (MeSH), author, source, publication type. 

Most of the GIRTS documents have German abstracts (96% of the docu- 
ments), some have English abstracts (8%). For the 76,128 documents 755,333 
controlled terms have been assigned, meaning, on average, each document has 
nearly 10 indexing terms. Some documents (nearly 9%) have free terms assigned 
which are only given by the indexing staff of the IZ to make proposals for new 
terms to be included in the thesaurus. The documents have on average two 
classifications assigned to each of them. The indexing rules allow assignment of 
one main classification, as well as one or more additional classifications if other 
(sub-)areas are treated in the document. The average number of authors for each 
document is nearly two. The average document size of the GIRT documents is 
about 2 KB. 



Field label 


^ Occurrences 
of field 


percent 
in GIRT3 docs 


Avg. # of 
entries per doc 


DOC 


76,128 


100.00 


1.00 


DOCNO 


76,128 


100.00 


1.00 


LANGUAGE 


76,128 


100.00 


1.00 


PUBLICATION YEAR 


76,128 


100.00 


1.00 


TITLE 


76,128 


100.00 


1.00 


TITLE-ENG 


54,275 


71.29 


- 


TEXT 


73,291 


96.27 


- 


TEXT-ENG 


6,063 


7.96 


- 


CONTROLLED-TERM 


755,333 


- 


9.92 


EREE-TERM 


6,588 


- 


0.09 


CLASSIFICATION 


169,064 


- 


2.22 


AUTHOR 


126,322 


- 


1.66 



Table 1. Statistics of the GIRT3 data collection 



The GIRT multilingual thesaurus (German-English), based on the Thesaurus 
for the Social Sciences ^ provides the vocabulary source for the indexing terms 
within GLEE (see Figure 2). A Russian translation of the German thesaurus is 
also available. The German-English thesaurus has about 10,800 entries, of which 
7,150 are descriptors and 3,650 non-descriptors. For each German descriptor 
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there is an English or Russian equivalent. The German non-descriptors have 
been translated into English in nearly every case, but this is not true for the 
Russian word list. There are smaller differences to the trilingual German-English- 
Russian word list, because it was completed earlier (1996) than the latest version 
of the Thesaurus (1999). Thus, English or Russian indexing terms could be used 
for retrieval purposes by matching to the equivalent German terms from the 
respective version of the thesaurus. 



<entry> 

<german>Absatzpolitik</gernnan> 

<related-concept>ABSATZPOLITIK</related-concept> 

<broader-term>Unternehmenspolitik</broader-term> 

<narrower-term>Werbung</narrower-term> 

<narrQwer-term>Produktgestaltung</narrower-ternn> 

<narrower-term>Preispolitik</narrower-term> 

<english>sales policy</english> 

</entry> 



Fig. 2. GIRT Thesaurus Entry 



The first GIRT collection (GIRTl), which was utilized for the pre-tests, con- 
tained a subset of the databases FORIS and SOLIS with about 13,000 documents 
which were restricted to the publication years 1987-1996 and to the topical areas 
of ’’sociology of work” , ’’women studies” and ’’migration and ethnical minorities” 
(with some additional articles without topical restrictions from two German top 
journals on sociology being published in this time-span). This topical restriction 
was obtained by choosing the appropriate classification codes as search criteria. 
The GIRT2 collection - offered in TREG7 and TREG8 - contained a subset of 
the databases FORIS and SOLIS, which included the GIRTl data, followed the 
same topical restrictions, but was enlarged to the publication years 1978-1996. 
This led to a specific topicality of the data, which had to be considered during 
the topic development process and restricted the possibilities of selecting top- 
ics. The distribution of descriptors and even of the words within the documents 
was also affected by these topical restrictions. The GIRTS collection - offered in 
the GLEF2000 campaign - has been broadened to all documents in this time- 
span regardless of their topics. Thus, this collection is an unbiased representative 
sample of documents in German social sciences between 1978 and 1996. 



5 Experiences and Opportnnities in TREC/CLEF with 
Domain Specific CLIR 

Although specific terminology and vocabularies must be changed for each new 
domain, this is more than compensated for by features which can be exploited 
in domain-specific cross-language information retrieval. Existing domain-related 
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- <top> 

<num>girt002</num> 

<E-title>Kids and Computer Games</E-title> 

<E-desc>How are computer games used by children?</E-desc> 

<E-narr>Find information on how chiidren use computer games and on the consequences of such 
use.</E-narr> 

</top> 

- <top> 

<num>girt002</num> 

<G-title>Kinder und Computerspiele</G-title> 

<G-desc>Was gibt es iiber die Nutzung von Computerspielen durch Kinder?</G-desc> 
<G-narr>Aile Informationen iiber die Benutzung und Auswirkung der Nutzung von 
Computerspielen durch Kinder sind von Interesse. Ebenso sind Untersuchungen iiber die 
Griinde von Gewalt sowie Programme und MaBnahmen gegen Gewalt relevant.</G-narr> 

</top> 



Fig. 3. GIRT Topic 002 - Children and computer games 



vocabularies or thesauri can be utilized to reduce ambiguity of search and in- 
crease precision of the results. For multilingual thesauri an additional benefit 
accrues from using them as translation tools because the related term pairs of 
languages are available. Use of the MESH multilingual thesaurus for CLIR was 
explored by Eichmann Ruiz and Srinivasan^ for the OHSUMED collection. 

Additional aids are given if there exist translated parts of the documents (of- 
ten the case for scientific literature, where English titles are frequently available 
for documents in other languages). This can allow a direct search against the 
translated document parts. The same advantage arises within existing document 
structures where the use of the specific meaning of different information elements 
allows a targeted search (i.e. if an author field exists, it possible to distinguish 
between a person as subject of an article or as the author of it). 

Thus far the GIRT collections have received limited attention by groups en- 
gaged in cross-language information retrieval. At TREG-8 there were two groups 
participating and at GLEE three groups participated and one of those submit- 
ted only a monolingual entry. The best monolingual entry was submitted by the 
Xerox European Research Gentre, while the cross-language entries came from 
the Berkeley Group 0 and the Dortmund Group 0. 



6 Conclusion 

This paper has discussed the domain-specific retrieval task at GLEF. The GIRT 
collection, oriented toward the social science domain, offers new opportunities 
in exploring cross-language information retrieval for specialized domains. The 
specific enhancements available with the GIRT collection are: 

— a collection indexed manually to a controlled vocabulary 

— bi-lingual titles (German and English) for almost all documents 

— a hierarchical thesaurus of the controlled vocabulary 

— multilingual translations of the thesaurus (German, English, Russian) 
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The multilingual thesaurus can be utilized as a vocabulary source for query 
translation and as a starting point for query expansion to enhance cross-language 
retrieval. Because each document is manually assigned, on average, by ten con- 
trolled vocabulary terms, the collection also offers the opportunity for research 
into multi-class text categorization. 



References 

[1] Lisa Ballesteros and W. Bruce Croft. Statistical methods for cross-language infor- 
mation retrieval. In Gregory Greffenstette, editor, Cross Language Information 
Retrieval, pages 21-40. Kluwer, 1998. 

[2] Gisbert Binder, Matthias Stahl, and Lothar Faulborn. Vergleichsuntersuchung 
MESSENGER - FULCRUM projektbericht available at http://www.bonn.iz- 
soz.de/publications/series/working-papers/abl8.pdf. In IZ-Arbeitsbericht Nr. 18, 
Bonn, 2000. 

[3] Hsinchun Chen, Joanne Martinez, Tobun Ng, and Bruce Schatz. A concept space 
approach to addressing the vocabulary problem in scientific information retrieval: 
An experiment on the worm community system. In .Journal of the American 
Society for Information Science, volume 48 (1), pages 17-31, 1997. 

[4] Hannelore Schott (ed.). Thesaurus for the Social Sciences. [Vol. 1:] German- 
English. [Vol. 2:] English- German. [Edition] 1999. InformationsZentrum Sozial- 
wissenschaften Bonn, 2000. 

[5] David Eichmann, Miguel Ruiz, and Padmini Srinivasan. Cross-language informa- 
tion retrieval with the UMLS metathesaurus. In W B Croft A Moffat C J van 
Rijsbergen R Wilkinson and J Zobel, editors. Proceedings of the 21st Annual In- 
ternational ACM SIGIR Conference on Research and Development in Information 
Retrieval, Melbourne, Australia, pages 72-80, August 1998. 

[6] Elisabeth Frisch and Michael Kluck. Pretest zum projekt german indexing and 
retrieval testdatabase (GIRT) unter anwendung der retrievalsysteme messenger 
und freewaissf. In IZ-Arbeitsbericht Nr. 10, Bonn, 1997. 

[7] Fredric Gey, Hailing Jiang, Vivien Petras, and Aitao Ghen. Cross-language re- 
trieval for the CLEF collections - comparing multiple methods of retrieval. In this 
volume. 

[8] Norbert Govert. Bilingual information retrieval with HyREX and internet trans- 
lation services. In this volume. 

[9] Stephanie W. Haas. Disciplinary variation in automatic sublanguage term iden- 
tification. In .Journal of American Society for Information Science, volume 48, 
pages 67-79, 1997. 

[10] William Hersh, Chris Buckley, TJ Leone, and David Hickman. OHSUMED: An 
interactive retrieval eavaluation and new large test collection for research. In 
W. Bruce Croft and C.J. van Rijsbergen, editors, Proceedings of SIGIR94, the 
Seventeenth Annual International ACM-SIGIR Conference on Research and De- 
velopment in Information Retrieval, pages 192-201, 1994. 

[11] William Hersh, Susan Price, and Larry Donohoe. Assessing thesaurus-based query 
expansion using the umls thesaurus. In Proceedings of the 2000 Annual AMIA 
Pall Symposium, pages 344-348, 2000. 

[12] Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, Hiroyuki Kato, 
and Souichiro Hidaka. Overview of IR tasks at the first NTCIR workshop. In 
Noriko Kando and Toshihiko Nozue, editors. The First NTCIR Workshop on 




56 



Michael Kluck and Fredric C. Gey 



Japanese Text Retrieval and Term Recognition , Tokoyo Japan, pages 11-22, 
September 1999. 

[13] Padmini Srinivasan. Query expansion and MEDLINE. In Information Processing 
and Management, volume 32(4), pages 431-443, 1996. 

[14] Christa Womser-Hacker(ed.) et al. Projektkurs Informationsmanagement: 
Durchfuhrung einer Evaluierungsstudie, Vergleich der Information- Retrieval- 
Systeme (IRS) DOMESTIC, LARS II, TextExtender. University of Konstanz. 
1998. 




Evaluating Interactive 
Cross-Language Information Retrieval: 
Document Selection 



Douglas W. Oard 

Human Computer Interaction Laboratory 
College of Information Studies and 
Institute for Advanced Computer Studies 
University of Maryland, College Park, MD 20742, USA 
oardSglue . umd . edu . edu, 
http : //www . glue . umd . edu/ ~oard/ 



Abstract. The problem of finding documents that are written in a lan- 
guage that the searcher cannot read is perhaps the most challenging 
application of Cross-Language Information Retrieval (CLIR) technol- 
ogy. The first Cross-Language Evaluation Forum (CLEF) provided an 
excellent venue for assessing the performance of automated CLIR tech- 
niques, but little is known about how searchers and systems might in- 
teract to achieve better cross-language search results than automated 
systems alone can provide. This paper explores the question of how in- 
teractive approaches to CLIR might be evaluated, suggesting an initial 
focus on evaluation of interactive document selection. Important evalua- 
tion issnes are identified, the structure of an interactive CLEF evaluation 
is proposed, and the key research communities that could be brought to- 
gether by such an evaluation are introduced. 



1 Introduction 

Cross-language information retrieval (CLIR) has somewhat uncharitably been 
referred to as “the problem of finding people documents that they cannot read.” 
Of course, this is not strictly true. For example, multilingual searchers might 
want to issue a single query to a multilingual collection, or searchers with a lim- 
ited active vocabulary (but good reading comprehension) in a second language 
might prefer to issue queries in their most fluent language. In this paper, how- 
ever, we focus on the most challenging case — when the searcher cannot read the 
document language at all. 

Before focusing on evaluation, it might be useful to say a few words about 
why anyone might want to find a document that they cannot read. The most 
straightforward answer, and the one that we will focus on here, is that after 
finding the document they could somehow obtain a translation that is adequate 
to support their intended use of the document (e.g., learning from it, summa- 
rizing it, or quoting from it). CLIR and translation clearly have a symbiotic 
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relationship — translation makes CLIR more useful, and CLIR makes translation 
more useful (if you never find a document that you cannot read, why would you 
need translation?). 

In the research literature, it has become common to implicitly treat CLIR as 
a task to be accomplished by a machine. Information retrieval is a challenging 
problem, however, and many applications require better performance than ma- 
chines alone can provide. In such cases, the only practical approach is to develop 
systems in which humans and machines interact to achieve better results than a 
machine can produce alone. A simple example from monolingual retrieval serves 
to illustrate this point. The top-ranked two documents that result from a Google 
search for “interactive CLIR” are about interactive products developed by the 
Council on Library and Information Resources. But an interactive searcher can 
easily recognize from the brief summaries that the next few documents in the 
ranked list use the search terms in the same manner as this paper. In this case, a 
system that might be judged a failure if used in a fully automatic (top-document) 
mode actually turns out to be quite useful when used as the automatic portion 
of a human-machine system. 

The process by which searchers interact with information systems to find 
documents has been extensively studied (for an excellent overview, see |3|). Es- 
sentially, there are two key points at which the searcher and the system interact: 
query formulation and document selection. Query formulation is a complex cog- 
nitive process in which searchers apply three kinds of knowledge — what they 
think they want, what they think the information system can do, and what they 
think the document collection being searched contains — to develop a query. The 
query formulation process is typically iterative, with searchers learning about the 
collection and the system, and often about what it is that they really wanted 
to know, by posing queries and examining retrieval results. Ultimately we must 
study the query formulation process in a cross-language retrieval environment 
if we are to design systems that effectively support real information seeking 
behaviors. But the Cross-Language Evaluation Forum (CLEF) is probably not 
the right venue for such a study, in part because the open-ended nature of the 
query formulation process might make it difficult to agree on a sharp focus for 
quantitative evaluation in the near term. 

Evaluation of cross-language document selection seems like a more straight- 
forward initial step. Interactive document selection is essentially a manual detec- 
tion problem — given the documents that are nominated by the system as being 
of possible interest, the searcher must recognize which documents are truly of 
interest. Modern information retrieval systems typically present a ranked list 
that contains summary information for each document (e.g., title, date, source 
and a brief extract) and typically also provide on-demand access to the full text 
of one document at a time. In the cross-language case, we assume that both the 
summary information and the full text are presented to the searcher in the form 
of automatically generated translations — a process typically referred to as “ma- 
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chine translation.”0 Evaluation of document selection seems to be well suited 
to the CLEF framework because the “ground truth” needed for the evaluation 
(identifying which documents should have been selected) can be determined us- 
ing the same pooled relevance assessment methodology that is used in the present 
evaluation of fully automatic systems 

Focusing on interactive CLIR would not actually be as a radical departure 
for CLEF as it might first appear. As Section explains, the principal CLEF 
evaluation measure — mean average precision — is actually designed to model the 
automatic component of an interactive search process, at least when used in a 
monolingual context. Section ^extends that analysis to include the effect of doc- 
ument selection, concluding that a focused investigation of the cross-language 
document selection problem is warranted. Sections 0 and 0 then sketch out the 
broad contours of what an interactive CLEF evaluation with such a focus might 
look like. Finally, Section 0 addresses the question of whether the necessary re- 
search base exists to justify evaluation of interactive CLIR by identifying some 
key research communities that are well positioned to contribute to the develop- 
ment of this technology. 

2 Deconstructing Mean Average Precision 

Two types of measures are commonly used in evaluations of cross-language in- 
formation retrieval effectiveness: ranked retrieval measures and set-based re- 
trieval measures. In the translingual topic tracking task of the Topic Detection 
and Tracking evaluation, a set based measure (detection error cost) is used. 
But ranked retrieval measures are reported far more commonly, having been 
adopted for the cross-language retrieval tasks in CLEF, TREC and NTCIR. The 
trec_eval software used in all three evaluations produces several useful ranked 
retrieval measures, but comparisons between systems are most often based on 
the mean uninterpolated average precision {MAP) measure. MAP is defined as: 

MAP = E,[Ej[-^] 

where Ei[] is the sample expectation over a set of queries, Ej[ ] is the sample 
expectation over the documents that are relevant to query i, and r{i,j) is the 
rank of the relevant document for query i. 

The MAP measure has a number of desirable characteristics. For example, 
improvement in precision at any value of recall or in recall at any value of 
precision will result in a corresponding improvement in MAP. Since MAP is 
so widely reported, it is worth taking a moment to consider what process the 
computation actually models. One way to think of MAP is as a measure of 
effectiveness for the one-pass interactive retrieval process shown in Figure [Din 
which: 

^ Note that the subsequent translation step — translation to support the ultimate use 
of the document — may or may not be accomplished using machine translation, de- 
pending on the degree of fluency that is required. 
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English 

Query 



Ranked 




Fig. 1. A one-pass monolingual search process. 



1. The searcher creates a query in a manner similar to those over which the 
outer expectation is computed. 

2. The system computes a ranked list in a way that seeks to place the topically 
relevant documents as close to the top of the list as is possible, given the 
available evidence (query terms, document terms, embedded knowledge of 
language characteristics such as stemming, . . . ). 

3. The searcher starts at the top of the list and examines each document (and/or 
summaries of those documents) until they are satisfied. 

4. The searcher becomes satisfied after finding some number of relevant docu- 
ments, but we have no a priori knowledge of how many relevant documents 
it will take to satisfy the searcher. Note that here we implicitly assume that 
every document is either relevant or it is not (in other words, we don’t ac- 
count for differences in the perceived degree of relevance), and that relevance 
assessments are independent (i.e., having seen one document does not change 
the searcher’s opinion of the relevance of another relevant document). 

5. The searcher’s degree of satisfaction is related to the number of documents 
that they need to examine before finding the desired number of relevant 
documents. 

Although actual interactive search sessions often include activities such as learn- 
ing and iterative query reformulation that are not modeled by this simple process, 
it seems reasonable to expect that searchers would prefer systems which perform 
better by this measure over systems that don’t perform as well. 

3 Modeling the Cross-Language Retrieval Process 

One striking feature of the process described above is that we have implicitly 
assumed that the searcher is able to recognize relevant documents when they 
see them. Although there will undoubtedly be cases when a searcher either over- 
looks a relevant document or initially believes a document to be relevant but 
later decides otherwise, modeling the searcher as a perfect detector is not an un- 
reasonable assumption when the documents are written in a language that the 
searcher can read. If the documents are written in a language that the searcher 
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can not read, the final three steps above could be modified as illustrated in 
Figure 13 to: 



English 

Query 




Fig. 2. A one-pass cross-language search process for searchers who cannot read 
French. 



3a. The searcher starts at the top of the list and examines an automatically 
produced translation of each document (or summary translations of 
those documents) until they are satisfied. 

4. a The searcher becomes satisfied after identifying a number of possibly rel- 
evant documents that they believe is sufficient to assure that they have 
found the desired number of relevant documents, but we have no a pri- 
ori knowledge of how many relevant documents it will take to satisfy the 
searcher. 0 

5a. The searcher commissions fluent translations of the selected docu- 
ments, and the searcher’s degree of satisfaction is related to both the number 
of documents that they needed to examine and the fraction of the translated 
documents that actually turn out to be relevant]^ 

Of course, this is only one of many ways in which a cross-language retrieval 
system might be usedjj But it does seem to represent at least one way in which 
a cross-language retrieval system might actually be employed, and it does so in 
a way that retains a clear relationship to the MAP measure that is already in 
widespread use. The actual outcome of the process depends on two factors: 

— The degree to which the automatically produced translations support the 
searcher’s task of recognizing possibly relevant documents. 

^ To retain a comparable form for the formula, it is also necessary to assume that the 
last document selected by the searcher actually happens to be relevant. 

^ This formulation does not explicitly recognize that the process may ultimately yield 
far too many or far too few relevant documents. If too few result, the searcher can 
proceed further down the list, commissioning more translations. If too many result, 
the searcher can adopt a more conservative strategy next time. 

An alternative process would be to begin at the top of the list and commission a 
fluent human translation of one document at a time, only proceeding to another 
document after examining the previous one. 
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— The searcher’s propensity to select documents as being possibly relevant in 
the presence of uncertainty. 



We model the combined effect of these factors using two parameters: 

Pr The probability of correctly recognizing a relevant document. 

Pf The probability of a false alarm (i.e., commissioning a translation for a doc- 
ument that turns out not to be relevant). 



We can now propose a measure of effectiveness C for interactive CLIR sys- 
tems in which the searcher can not read the language of the retrieved documents: 



C = k-E,[E,[^]] + {l-k)E^[E,[ 

= k-pr- MAP+{1 - k){l-pf{l - MAP)) 



3 + i.i.^-Pf)jr{i,j) - j))i 



Kb j) 



where the free parameter A: G [0,1] reflects the relative importance to the searcher 
of limiting the number of examined documents (the first term) and of limiting 
the translation of non-relevant documents (the second term)lj The first term 
reflects a straightforward adjustment to the formula for mean average precision 
to incorporate Pr- In the second term, success is achieved if the document is 
actually relevant (j) or if the document is not relevant {r(i,j) — j)) and is not 
selected by the searcher for translation (1 — p/)0 In practice, we expect one or 
the other term to dominate this measure. When the machine translation that is 
already being produced for use in the interface will suffice for the ultimate use 
of any document, fc « 1, so: 



C K.pr- MAP 



By contrast, when human translation is needed to achieve adequate fluency for 
the intended use, we would expect A; « 0, making the second term dominant: 

Ck 1-Pf{l- MAP) 

In either case, it is clear that maximizing MAP is desirable. When machine 
translation can adequately support the intended use of the documents, the factor 
that captures the searcher’s contribution to the retrieval process is Pr (which 
should be as large as possible). By contrast, when human translation is necessary, 
the factor that captures the searcher’s contribution is p/ (which should be as 
small as possible). This analysis suggests three possible goals for an evaluation 
campaign: 

MAP. This has been the traditional focus of the CLIR evaluations at TREC, 
NTCIR and CLEF. Improvements in MAP can benefit a broad range of 
applications, but with 70-85% of monolingual MAP now being routinely 
reported in the CLIR literature, shifting some of the focus to other factors 
would be appropriate. 

® The linear combination oversimplifies the situation somewhat, and is best thought 
of here as a presentation device rather than as an accurate model of value. 

® For notational simplicity, pr and pf have been treated as if they are independent of 
i and j. 
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Pr- A focus on pr is appropriate when the cost of finding documents dominates 
the total cost, as would be the case when present fully automatic machine 
translation technology produces sufficiently fluent translations. 

Pf. A focus on pf is appropriate when the cost of obtaining a translations that 
are suitable for the intended use dominates the total cost, as would be the 
case when substantial human involvement in the translation process is re- 
quired. Although it may appear that pf = 0 could be achieved by simply 
never commissioning a translation, such a strategy would be counterproduc- 
tive since no relevant documents would ever be translated. The searcher’s 
goal in this case must therefore be to achieve an adequate value for pr while 
minimizing pj. 

The second and third of these goals seem equally attractive, since both model 
realistic applications. The next section explores the design of an evaluation 
framework that would be sufficiently flexible to accommodate either focus. 

4 Evaluating Document Selection 

Although there has not yet been any coordinated effort to evaluate cross-language 
document selection, we are aware of three reported user study results that have 
explored aspects of the problem. In one, Oard and Resnik adopted a classifica- 
tion paradigm to evaluate browsing effectiveness in cross-language applications, 
finding that a simple gloss translation approach allowed users to outperform a 
Naive Bayes classifier Q. In the second, Ogden et ah, evaluated a language- 
independent thumbnail representation for the TREC-7 interactive track, finding 
that the use of thumbnail representations resulted in even better instance recall 
at 20 documents than was achieved using English document titles 0. Finally, 
Oard, et al. described an experiment design at TREC-9 in which documents 
judged by the searcher as relevant were moved higher in the ranked list and 
documents judged as not relevant were moved lower They reported that the 
results of a small pilot study were inconclusive. All three of these evaluation ap- 
proaches reflect the effect of Pr and p/ in a single measure, but they each exploit 
an existing evaluation paradigm that limits the degree of insight that can be 
obtained. Four questions must be considered if we are to evaluate an interactive 
component of a cross-language retrieval system in a way that reflects a vision of 
how that system might actually be used: 

— What process do we wish to model? 

— What independent variable(s) (causes) do we wish to consider? 

— What dependent variable(s) (effects) do we wish to understand? 

— How should the measure (s) of effectiveness be computed? 

Two processes have been modeled in the Text Retrieval Conference (TREC) 
interactive track evaluations. In TREC-5, -6, -7 and -8, subjects were asked to 
identify different instances of a topic (e.g., different countries that import Cuban 
sugar). This represents a shift in focus away from topical relevance and towards 
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what is often called “situational relevance.” In the situational relevance frame- 
work, the value of a document to a searcher depends in part on whether the 
searcher has already learned the information contained in that document. In 
the TREC interactive track, subjects were not rewarded for finding additional 
documents on the same aspect of a topic. The TREC-9 interactive track mod- 
eled a related process in which searchers were required to synthesize answers to 
questions based on the information in multiple documents. 

Moving away from topical relevance makes sense in the context of mono- 
lingual retrieval because the searcher’s ability to assess the topical relevance of 
documents by reading them is already well understood (c.f., |IS|)- Such is not 
the case in cross-language applications, where translation quality can have a 
substantial impact on the searcher’s ability to assess the topical relevance. An 
initial focus on a process based on topical relevance can thus be both informa- 
tive and economical (since the same relevance judgments used to evaluate fully 
automatic systems can be used). 

The next two questions deal with cause and effect. The complexity of an 
evaluation is roughly proportional to the product of the cardinality of the inde- 
pendent variables, so it is desirable to limit the choice of independent variables 
as much as possible. In the TREC, NTCIR and CLEF evaluations of the fully 
automatic components of CLIR systems, the independent variable has been the 
retrieval system design and the dependent variable has been retrieval system 
effectiveness. Since we are interested in the interactive components of a cross- 
language retrieval system, it would be natural to hold the fully automatic com- 
ponents of the retrieval system design constant and vary the user interface design 
as the independent variable. This could be done by running the automatic com- 
ponent once and then using the same ranked list with alternate user interface 
designs. Although it might ultimately be important to also consider other de- 
pendent variables (e.g., response time), retrieval effectiveness is an appropriate 
initial focus. After all, it would make little sense to deploy a fast, but ineffective, 
retrieval system. 

The final question, the computation of measure(s) of effectiveness, actually 
includes two subquestions: 

— What measure(s) would provide the best insight into aspects of effectiveness 
that would be meaningful to a searcher? 

— How can any effects that could potentially confound the estimate of the 
measure (s) be minimized? 

When a single- valued measure can be found that reflects task performance with 
adequate fidelity, such a measure is typically preferred because the effect of 
alternative approaches can be easily expressed as the difference in the value of 
that measure. Mean average precision is such a measure for ranked retrieval 
systems. Use of a ranked retrieval measure seems inappropriate for interactive 
evaluations, however, since we have modeled the searcher’s goal as selecting 
(rather than ranking) relevant documents. 
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One commonly used single- valued measure for set-based retrieval systems is 
van Rijsbergen’s F measure, which is a weighted harmonic mean of recall and 
precision: 



a = 1 i 

^ ^ _1_ 

P ^ R 

1 



where P is the precision (the fraction of the selected documents that are rele- 
vant), R is the recall (the fraction of the relevant documents that are selected), 
and l3 is the ratio of relative importance that the searcher ascribes to recall and 
precision H31. It is often assumed that (3=1 (which results in the unweighted 
harmonic mean), but the value for (3 in an interactive CLIR evaluation should be 
selected based on the desired balance between on pr and pf that is appropriate 
for the process being modeled. 

Another possibility would be to adopt an additive utility function similar to 
that used for set-based retrieval evaluation in the TREC filtering track and the 
Topic Detection and Tracking (TDT) evaluation: 

Ca,b = Nr-{a-Nf + b-Nm) 

where Nr is the number of relevant documents that are selected by the user, 
Nf is the number of false alarms (non-relevant documents that are incorrectly 
selected by the user), Nm is the number of misses (relevant documents that are 
incorrectly rejected by the user), and a and b are weights that reflect the costs of 
misses and and false alarms relative to the value of correctly selecting a relevant 
document. 

Regardless of which measure is chosen, several factors must be considered in 
any study design: 

— A system effect, which is what we seek to measure. 

— A topic effect in which some topics may be “easier” than others. This could 
result, for example, from the close association of an unambiguous term (a 
proper name, perhaps) with one topic, while another might only be found 
using combinations of terms that each have several possible translations. 

— A topic-system interaction, in which the effect of a topic compared to some 
other topic varies depending on the system. This could result, for example, 
if one system was unable to translate certain terms that were important to 
judging the relevance of a particular topic. 

— A searcher effect, in which one searcher may make relevance judgments more 
conservatively than another. 

— A searcher-topic interaction, in which the effect of a searcher compared to 
some other searcher varies depending on the topic. This could result, for 
example, from a searcher having expert knowledge on one some topic that 
other searchers must judge based on a less detailed understanding. 
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— A searcher-system interaction, in which the effect of a searcher compared to 
some other searcher varies depending on the system. This could result, for 
example, from one searcher having better language skills, which might be 
more important when using one system than another. 

— A searcher-topic-system interaction. 

In the CLEF evaluation for fully automatic CLIR, the topic has been modeled 
as an additive effect and accommodated by taking the mean of the uninterpo- 
lated average precision over a set of (hopefully) representative topics. In the 
TREC interactive track, the topic and searcher have been modeled as additive 
effects, and accommodated using a 2 x 2 Latin square experiment design. Four 
searchers were given 20 minutes to search for documents on each of six topics in 
the TREC-5 and TREC-6 interactive track evaluations mm- Eight searchers 
were given 15 minutes to search for documents on each of eight topics in the 
TREC-7 interactive track evaluation m Twelve searchers were given 20 min- 
utes to search for documents on each of six topics in the TREC-8 interactive 
track evaluation In each case, the Latin square was replicated as many times 
as the number of searchers and topics allowed in order to minimize the effect of 
the multi-factor interactions. Cross-site comparisons proved to be uninformative, 
and were dropped after TREC-6 m The trend towards increasing the number 
of searchers reflects the difficulty of discerning statistically significant differences 
with a limited number of searchers and topics ^ . User studies require a substan- 
tial investment — each participant in the TREC-8 interactive track was required 
to obtain the services of twelve human subjects with appropriate qualifications 
(e.g., no prior experience with either system) for about half a day each and to 
develop two variants of their interactive retrieval system. 



5 An Interactive CLIR Track for CLEF? 

The foregoing discussion suggests that it would be both interesting and practical 
to explore interactive CLIR at one of the major CLIR evaluations (TREC, CLEF, 
and/or NTCIR). In thinking through what such an evaluation might look like 
in the context of CLEF, the following points should be considered: 

Experiment Design. The replicated Latin square design seems like a good 
choice because there is a wealth of experience to draw upon from TREC. 
Starting at a small scale, perhaps with four searchers and six topics, would 
help to minimize barriers to entry, an important factor in any new evaluation. 
Options could be provided for teams that wished to add additional searchers 
in groups of 4. Allowing searchers 20 minutes per topic is probably wise, 
since that has emerged as the standard practice in the TREC interactive 
track. The topic selection procedure will need to be considered carefully, 
since results for relatively broad and relatively narrow topics might differ. 
Evaluation Measure. There would be a high payoff to retaining an initial fo- 
cus on topical relevance, at least for the first evaluation, since documents 
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found by interactive searchers could simply be added to the relevance judg- 
ment pools for the main (fully automatic) evaluation. The measure might 
be a good choice, although further analysis would be needed to determine 
an appropriate value for f3 once the relative importance of Pr and p/ is de- 
cided, and other measures should also be explored. The instructions given 
to the subjects will also be an important factor in minimizing a potential 
additional effect from misunderstanding the task. Subjects without formal 
training in relevance assessment sometimes confound the concept of topical 
relevance (the relationship between topic and document that is the basis for 
evaluation in CLEF) with the concept of situational relevance (a relationship 
between a searcher’s purpose and a document that captures the searcher’s 
assessment of the suitability of the document for that [possibly unstated] 
purpose). Providing clear instructions and adequate time for training will be 
essential if relevance assessments are to be obtained from subjects that are 
comparable to the ground truth relevance judgments produced by the CLEF 
assessors. 

Document Language. It would be desirable to agree on a common document 
collection because it is well known that the performance of retrieval sys- 
tems varies markedly across collections. That may be impractical in a place 
as linguistically diverse as Europe, however, since the choice of any single 
document language would make it difficult for teams from countries where 
that language is widely spoken to find cross-language searchers. For the first 
interactive cross-language evaluation, it might therefore make more sense to 
allow the use of documents in whichever language(s) would be appropriate 
for the searchers and for the translation resources that can be obtained. 
Retrieval System. Interactive cross-language retrieval evaluations should fo- 
cus on the interactive components of the system, so to the extent possible 
the fully automatic components should be held constant. If the participants 
agree to focus on interactive document selection, the use of a common ranked 
list with different interfaces would seem to be appropriate. Providing a stan- 
dard ranked list of documents for each topic would help reduce barriers to 
entry by making it possible for a team to focus exclusively on user interface 
issues if that is their desire. Since cross-site comparisons were found to be 
uninformative in the TREC interactive track, it is probably not necessary 
to require the use of these standard ranked lists by every team. 

Two non-technical factors will also be important to the success of an inter- 
active cross-language retrieval track within a broader evaluation campaign. The 
first, an obvious one, is that coordinating the track will require some effort. A 
number of experiment design issues must be decided and communicated, results 
assembled, reports written, etc. The second, perhaps even more important, is 
that the track would benefit tremendously from the participation of one or more 
teams that already have experience in both the TREC interactive track and at 
least one cross-language retrieval evaluation. Several teams with this sort of ex- 
perience exist, including Sheffield University in the U.K., the IBM Thomas J. 
Watson Research Center, New Mexico State University, the University of Cali- 
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fornia at Berkeley and the University of Massachusetts at Amherst in the USA, 
and the Royal Melbourne Institute of Technology in Australia. With this depth 
of experience, the critical mass needed to jump start the evaluation process may 
indeed be available. 



6 Forming a Research Community 

CLEF is an example of what is known as an evaluation-driven research paradigm, 
in which participants agree on a common problem, a common model of that prob- 
lem, and a common set of performance measures. Although evaluation-driven re- 
search paradigms risk the sort of local optimization that can result from choice 
of a single perspective, a key strength of the approach is that it can foster rapid 
progress by bringing together researchers that might not otherwise have occa- 
sion to collaborate, to work in a common framework on a common problem. It 
is thus natural to ask what about the nature of the research community that 
would potentially participate in an interactive CLIR evaluation. One measure 
of the interest in the field is that a workshop on this topic at the University of 
Maryland attracted eighteen participants from nine organizations and included 
five demonstrations of working prototype systems Q. Another promising factor 
is the existance of three complementary literatures that offer potential sources 
of additional insights into how the cross-language document selection task might 
be supported: machine translation, abstracting/text summarization, and human- 
computer interaction. 

Machine translation has an extensive research heritage, although evaluation 
of translation quality in a general context has proven to be a difficult problem. 
Recently, Taylor and White inventoried the tasks that intelligence analysts per- 
form using translated materials and found two (discarding irrelevant documents 
and finding documents of interest) that correspond exactly with cross-language 
document selection US!- Their ultimate goal is to identify measurable character- 
istics of translated documents that result in improved task performance. If that 
line of inquiry proves productive, the results could help to inform the design of 
the machine translation component of document selection interfaces. 

The second complementary literature is actually a pair of literatures, alter- 
nately known as abstracting (a term most closely aligned with the bibliographic 
services industry) and text summarization (a term most closely aligned with 
research on computational linguistics). Bibliographic services that process doc- 
uments in many languages often produce abstracts in English, regardless of the 
document language. Extensive standards already exist for the preparation of ab- 
stracts for certain types of documents (e.g., Z39.I4 for reports of experimental 
work and descriptive or discursive studies jS|), and there may be knowledge in 
those standards that could easily be brought to bear on the parts of the cross- 
language document selection interface that involve summarization. There is also 
some interest in the text summarization community in cross-language text sum- 
marization, and progress on that problem might find direct application in CLIR 
applications. One caveat in both cases is that, as with translation, the quality of 
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a summary can only be evaluated with some purpose in mind. Document selec- 
tion requires what is known in abstracting as an “indicative abstract.” Research 
on “informative” or “descriptive” abstracts may not transfer as directly. 

Finally, the obvious third complementary literature is human-computer in- 
teraction. Several techniques are known for facilitating document selection in 
monolingual applications. For example, the “keyword in context” technique 
is commonly used in document summaries provided by Web search engines — 
highlighting query terms and showing them in the context of their surrounding 
terms. Another example is the “show best passage” feature that some text re- 
trieval systems (e.g.. Inquery) provide. Extending ideas like these to work across 
languages is an obvious starting point. Along the way, new ideas may come 
to light. For example, Davis and Ogden allowed searchers to drill down during 
cross-language document selection by clicking on a possibly mistranslated word 
to see a list of alternative translations P] . 

Drawing these diverse research communities together with the existing CLIR 
community will be a challenge, but there is good reason to believe that each 
would find an interactive CLIR evaluation to be an attractive venue. The design 
of tractable evaluation paradigms has been a key challenge for both machine 
translation and text summarization, so a well designed evaluation framework 
would naturally attract interest from those communities. Human-computer in- 
teraction research is an enabling technology rather than an end-user application, 
so that community would likely find the articulation of an important problem 
that is clearly dependent on user interaction to be of interest. As we have seen 
in the CLIR and TREC interactive track evaluations, the majority of the partic- 
ipants in any interactive CLIR evaluation will likely self-identify as information 
retrieval researchers. But experience has shown that the boundaries become 
fuzzier over time, with significant cross-citation between complementary litera- 
tures, as the community adapts to new challenges by integrating new techniques. 
This community-building effect is perhaps one of the most important legacies of 
any evaluation campaign. 

7 Conclusion 

Reviewing results from the TREC interactive track, Hersh and Over noted that 
“users showed little difference across systems, many of which contained features 
shown to be effective in non-interactive experiments in the past” Pursuing 
this insight, Hersh et al. found that an 81% relative improvement in mean av- 
erage precision resulted in only a small (18%) and not statistically significant 
improvement in instance recall jSj. If this were also true of CLIR, perhaps we 
should stop working on the problem now. The best CLIR systems already report 
mean average precision values above 75% of that achieved by their monolingual 
counterparts, so there appears to be little room for further improvement in the 
fully automated components of the system. But the results achieved by Hersh 
et al. most likely depend at least in part on the searcher’s ability to read the 
documents that are presented by the retrieval system, and it is easy to imagine 
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CLIR applications in which that would not be possible without some form of au- 
tomated translation. If we are to make rational decisions about where to invest 
our research effort, we must begin to understand CLIR as an interactive process. 
Beginning with a focus on the cross-language document selection process seems 
to be appropriate, both for the insight that it can offer and for the tractability 
of the evaluation. 

We somewhat euphemistically refer to our globally interconnected informa- 
tion infrastructure as the World-Wide Web. At present, however, it is far less 
than that. For someone who only reads English, it is presently the English- Wide 
Web. A reader of only Chinese sees only the Chinese-Wide Web. We are still 
faced with two problems that have been with us since the Tower of Babel: how 
to find the documents that we need, and how to use the documents that we 
find. The global series of CLIR evaluations — TREC, NTCIR and CLEF — have 
started us on the path of answering the first question. It is time to take the sec- 
ond step along that path, and begin to ask how searchers and machines can work 
together to find documents in languages that the searcher cannot read better 
than machines can alone. 



Acknowledgments 

The author is grateful to the participants in the June 2000 workshop on interac- 
tive cross-language retrieval, and especially to Bill Ogden, Gina Levow and Jian- 
qiang Wang, for stimulating discussions on this subject, and to Rebecca Hwa, 
Paul Over and Bill Hersh for their helpful comments on earlier versions of this 
paper. This work was supported in part by DARPA contract N6600197C8540 
and DARPA cooperative agreement N660010028910. 

References 

1. Workshop on interactive searching in foreign-langnage collections (2000) 
http: / / www.clis.nmd.edn/conferences/hcil00 / 

2. Davis, M., Ogden, W. C.: Quilt: Implementing a large-scale cross-language text 
retrieval system. In Proceedings of the 20th International ACM SIGIR Conference 
on Research and Development in Information Retrieval (1997) 

3. Hearst, M. A.: User interfaces and visualization. In Baeza- Yates, R., Ribeiro-Neto, 
B., Modern Information Retrieval, chapter 10. Addison Wesley, New York (1999) 
http:/ /www. sims.berkeley.edu/~hearst/irbook/chapters/chaplO.html. 

4. Hersh, W., Over, P.: TREC-8 interactive track report. In The Eighth Text RE- 
trieval Conference (TREC-8) (1999) 57-64 http://trec.nist.gov. 

5. Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., Olson, D.: 
Do batch and user evaluations give the same results? In Proceedings of the 23nd 
Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval (1998) 17-24 

6. National Information Standards Organization: Guidelines for Abstracts 

(ANSI/NISO Z39.14-1997) NISO Press (1997) 




Evaluating Interactive Cross-Language Information Retrieval 



71 



7. Card, D. W., Levow, G.-A., Cabezas, C. I.: TREC-9 experiments at Maryland: 
Interactive CLIR. In The Ninth Text Retrieval Conference (TREC-9) (2000) To 
appear, http://trec.nist.gov. 

8. Oard, D. W., Resnik, P. Support for interactive document selection in cross- 
language information retrieval. Information Processing and Management 35(3) 
(1999) 363-379 

9. Ogden, W., Cowie, J., Davis, M., Ludovik, E., Molina-Salgado, H., Shin, 

H.: Getting information from documents you cannot read: An interac- 

tive cross-language text retrieval and summarization system. In Joint ACM 
DL/SIGIR Workshop on Multilingual Information Discovery and Access (1999) 
http:/ /www. clis.umd.edu/conferences/midas.html. 

10. Over, P.: TREC-5 interactive track report. In The Fifth Text REtrieval Conference 
(TREC-5) (1996) 29-56 http://trec.nist.gov. 

11. Over, P.: TREC-6 interactive track report. In The Sixth Text REtrieval Conference 
(TREC-6) (1997) 73-82 http://trec.nist.gov. 

12. Over, P.: TREC-7 interactive track report. In The Seventh Text REtrieval Con- 
ference (TREC-7) (1998) 65-71 http://trec.nist.gov. 

13. Taylor, K., White, J.: Predicting what MT is good for: User judgments and task 
performance. In Third Conference of the Association for Machine Translation in 
the Americas (1998) 364-373 Lecture Notes in Artificial Intelligence 1529. 

14. van Rijsbergen, C. J.: Information Retrieval. Butterworths, London, second edition 
(1979) 

15. Wilbur, W. J.: A comparison of group and individual performance among sub- 
ject experts and untrained workers at the document retrieval task Journal of the 
American Society for Information Science, 49(6) (1998) 517-529 




New Challenges for Cross-Language Information 
Retrieval: Multimedia Data and the 
User Experience 



Gareth J.F. Jones 

Department of Computer Science, University of Exeter, EX4 4PT, U.K. 
G . J . F . JonesOexeter . ac . uk, 
http: //www. dcs . ex . ac.uk/~gareth 



Abstract. Evaluation exercises in Cross-Language Information Retrieval 
(CLIR) have so far been limited to the location of potentially relevant 
documents from within electronic text collections. Although there has 
been considerable progress in recent years much further research is re- 
quired in CLIR, and clearly one focus of future research must continue to 
address fundamental retrieval issues. However, CLIR is now sufficiently 
mature to broaden the investigation to consider some new challenges. 
Two interesting further areas of investigation are the user experience of 
accessing information from retrieved documents in CLIR, and the ex- 
tension of existing research to cross-language methods for multimedia 
retrieval. 



1 Introduction 

The rapid expansion in research into Cross-Language Information Retrieval 
(CLIR) in recent years has produced a wide variety of work focusing on retrieval 
from different language groupings and using varying translation techniques. For- 
mal evaluation exercises in CLIR have so far concentrated exclusively on cross- 
language and multilingual retrieval for electronic text collections. This work only 
reflects one aspect of the complete Information Access (lA) process required for 
Cross-Language Information Access (CLIA). In this paper the complete process 
of lA is taken to involve a number of processes: user description of informa- 
tion need in the form of a search request, identification of potentially relevant 
documents by a retrieval engine, relevance judgement by the user of retrieved 
documents, and subsequent user extraction of information from individual re- 
trieved documents. These issues all become more complicated in cross-language 
and multilingual environments, where in particular the relevance assessment and 
information extraction stages have received very little attention. Although there 
have been a number of individual studies exploring lA techniques for retrieved 
documents, e.g. P I2| PI, thus far there have been no common tasks on which 
lA techniques can be compared and contrasted for either monolingual or mul- 
tilingual data. In addition, CLIR and other related retrieval technologies are 
now sufficiently mature to begin exploration of the more challenging task of 
cross-language retrieval from multimedia data. 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 72WTH 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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This paper explores issues in CLIA evaluation in more detail and reviews 
relevant existing work in multimedia retrieval. Section |2] focuses on CLIA from 
the user’s perspective, Section 0 reviews current research in Multimedia Infor- 
mation Retrieval (MIR), Section 0 considers the challenges of Cross-Language 
Multimedia Information Retrieval (CLMIR), and Section 0 gives some conclud- 
ing thoughts. 

2 The User in Cross-Language Information Access 

The user is actively involved in most stages of the Information Access process. 
User activities of course include forming search requests, but also the judgement 
of retrieved document relevance and the extraction of information from individ- 
ual retrieved documents. Extending evaluation exercises beyond assessing the 
effectiveness of document retrieval to these other stages of lA is important in 
order to assess the usefulness of techniques which are designed to assist the user 
with relevance judgement and information extraction. This section reviews some 
existing relevant work in these areas. 

2.1 Relevance Judgement 

For monolingual text retrieval it is typically assumed that the user can rapidly 
make relevance judgements about retrieved documents by skimming a portion of 
the document. Users are typically provided with the title and first few sentences 
of the document to decide its possible relevance to their request. A more sophis- 
ticated method of providing summary information for relevance judgement is 
suggested by query-biased summaries as described in P| . Another approach sug- 
gested for this situation uses a graphical representation of the document contents 
with respect to the query terms to show the level of matching between a query 
and an individual document, and the distribution of search terms within the 
document. There are several examples of graphical representations which can 
be used in this way including Thumbnail document images Q and document 
TileBars 0. Limited experiments have suggested that users are able to make 
relevance judgements with some degree of reliability based only on a graphi- 
cal representation of this form without actually accessing linguistic information 
within the document. 



2.2 Cross- Language Relevance Assessment 

Assessment of relevance can be more complicated for cross-language retrieval. 
Clearly this depends on the users and their level of fluency in the document 
language. If the users are fluent in the document language, perhaps they don’t 
really need CLIR at all. However, to keep matters simple let’s consider here 
only the situation of the user knowing little or nothing about the document 
language, e.g. a typical English reader with Chinese documents or a Japanese 
reader with German documents. In this situation, even selecting a document 
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as potentially relevant is impossible using the document itself in its raw form. 
Existing work in this area has begun to discuss ideas such as using Machine 
Translation (MT) techniques for assessing relevance in ranked retrieval lists, 
e.g. augmenting the summaries which typically appear in these ranked lists with 
corresponding translations into the request language. If the user finds a document 
potentially interesting it is retrieved in the usual way and then fully translated 
to the request language using MT prior to presentation to the user |^. So far 
studies have only been suggestive that these methods are useful rather than 
having been formally evaluated as such. An alternative approach is suggested in 
^ and 1^ where users were presented with simple gloss translations of returned 
documents. In gloss translations terms in the documents are replaced by one or 
more likely translations of the word. It is generally observed that users are able 
to disambiguate alternative translations to select the contextually correct one. 
The graphical relevance assessment methods outlined in the previous section can 
also be applied to CLIR; one example of this approach is given in PJ. At present, 
there do not appear to have been any comparative evaluations of the relative 
effectiveness of these various relevance judgement strategies. 



2.3 Information Access from Retrieved Documents 

Evaluation measures in CLIR have focussed on the traditional metrics of recall 
and precision. While obviously useful, these are rather limited instruments in 
what they tell us generally about the usefulness of a retrieval system to the user, 
particularly for systems involving cross-language and multimedia retrieval. 

As suggested in the last section, after the user has selected a potentially 
relevant document it can be translated into the request language using an MT 
system prior to being presented to the user. The automatic translation process is 
imperfect, stylistic problems will often be introduced, but factual errors may be 
introduced as well. With respect to translation of informative documents, such 
as mail messages or news reports, factual accuracy is probably more important 
than style. For example, if the output of an MT system gets the day or time of 
a meeting wrong considerable inconvenience could result, regardless of how well 
the prose might be constructed. A useful GLIA motivated investigation here may 
be to explore whether the translated version of the document actually contains 
the data required to satisfy the user’s information need. The system may have 
successfully retrieved a relevant document, and this will show up positively in 
precision and recall measurements, but a further important question is: can the 
information in the document which causes it to be relevant be made available 
to the user in a form that they can understand, e.g. in a different language. 
Essentially this is looking for a quantitative measure of the reliability of the 
translation process which could be directly bound into a retrieval task or could be 
connected to a task exploring the answering of an information need by particular 
documents. A further evaluation task could be to look at interactive information 
seeking in CLIR, this would allow exploration of issues such as the possible 
involvement of the user in the translation process. 
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We might further want to look at summarisation in CLIR, both for document 
browsing and accessing facts. Summarisation has so far not been included in the 
TREC evaluation programme, but this is a topic of increasing importance as 
recognised in the Japanese NTCIR evaluation programme where it has been 
included as a formal task in the 2nd NTCIR workshop programme. 

Recent TREC tracks have looked at the user’s interaction with data and 
query development in interactive retrieval, and more recently a track looking at 
Question- Answering has been introduced Q . These are currently only running 
for monolingual text retrieval, but could potentially be extended to CLIR, both 
separately and possibly in combination. 

3 Multimedia Information Retrieval 

Existing research in CLIR has focussed almost exclusively on text retrieval. How- 
ever, documents may originate in various different media, e.g. typed text, spoken 
data, document images from paper or video data. Research in Multimedia Infor- 
mation Retrieval (MIR) has focussed almost exclusively on monolingual retrieval. 
However, this work is itself now sufficiently mature to begin exploration into 
systems capable of effective cross-language retrieval tasks. This section briefly 
reviews research in spoken document and scanned document retrieval. The fol- 
lowing section then considers the extension of this work to cross-language tasks. 



3.1 Spoken Document Retrieval 

Spoken Document Retrieval (SDR) has been an active area of research for around 
10 years. The first work was carried out by Rose in the early 1990’s ||. Rose 
used a fixed-vocabulary word spotter to assign spoken documents to one of ten 
predefined categories. The first research to address a more standard ad hoc infor- 
mation retrieval task was carried out at ETH Zurich by Glavitsch and Schauble 
m- This research explored the use of a set of predefined subword units for 
open-vocabulary retrieval of spoken data. Two other early research projects 
were conducted at Cambridge University. Both of these explored the use of a 
subword phone lattice and large vocabulary speech recognition for document in- 
dexing. One project by James HH investigated the retrieval of BBC radio news 
broadcasts and the larger Video Mail Retrieval (VMR) project focussed on the 
retrieval of video mail messages [Ei[n|. The first large scale digital video library 
project to explore SDR was Informedia at Carnegie Mellon University. This on- 
going project began by focusing on automated retrieval of broadcast television 
news Pi- 

Figure [Dshows a manual transcription of a spoken message from the VMRlb 
collection used in the VMR project PI This transcription has been constructed 
carefully to include disfiuency markers such as [urn] [ah] and [loud_breath] , 
as well as [pause] markers. This punctuation is inserted here by inference from 
listening to the prosody of the speech, and is added to aid reading of the mes- 
sages. Inserting these markers automatically as part of a transcription process 
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M524p003: [tongue_click] 0 K, [pause] I’ve finally arranged a time and 
a date that everyone on the project can make. [loud_breath] [urn] 
[tongue_click] The time is ten o’clock and the date is Wednesday the 
twenty fifth, [pause] [ah] Monday and Tuesday were out unfortunately, 
[loud_breath] [ah] if anyone can’t make this time and date please can 
they let me know as soon as possible [pause] and I’ll arrange [pause] the 
meeting for another [loud_breath] mutually acceptable time. [loud_breath] 
The main thing that we’re going to be discussing is the upcoming 
deadline . 

Fig. 1. Example of a manual VMR message transcription with transcriber in- 
serted punctuation. 

M524p003: THE K. R. 0. FINALLY ARRANGED A TIME AND A LIGHT OF THE RIVAL 

PRODUCT AND WON’T LEARN THAT TIME THIS TYPE OF CANDIDATE THERE IS ON 
WEDNESDAY THE TWENTY FIFTH WHILE MANAGER OF THE ROAD AND FOR THE OVERALL 
CAR LIKE THIS TIME IN DIRECT CONTROL AND AS SOON AS POSSIBLE AND RANGE 
OF A MEETING FOR A NUCLEAR SUPPORT ON THE MAIN THING WE NEVER DISCUSS THE 
NEWS THAT KIND OF LINE 

Fig. 2. Example of 20K Large Vocabulary Speech Recogniser VMR message 
transcription. 



would be a very challenging task. Applying the recorded audio file of the example 
message to the 20K Large Vocabulary Speech Recognition (LVR) system used in 
the VMR project produces the transcription shown in Fig. |21 Automated tran- 
scription of all messages in VMRlb gave retrieval performance of around 80% 
of that achieved with the manual text transcription. 

SDR has featured as a track at the annual TREC conferences for the last 
4 years. This began with a known-item search task in TREC 6 [ig and has 
moved on to progressively more challenging ad hoc tasks in subsequent years [Zj 
P). The data used in the TREC SDR tasks was broadcast TV and radio news. 
Techniques have been developed to deliver performance levels for spoken data 
very similar to near correct manual transcriptions, and it has been decided that 
the SDR in its current form will not be run at future TRECs. 

While it appears that SDR is largely solved for retrieval of English language 
broadcast news there are a number of challenges which still remain. These in- 
clude proper investigation and evaluation of SDR for other languages, such as 
European and Asian languages. While many of the techniques developed for 
English SDR may well prove effective for other languages, the features of these 
languages may require enhancement of existing methods or the development of 
new ones. An important point with respect to spoken data is the availability of 
suitable speech recognition systems. One of the results of existing SDR studies 
is the high correlation between speech recognition accuracy and retrieval per- 
formance. An important investigation is to explore how to perform the best 
retrieval for languages where high quality speech recognition resources are not 
currently available. 
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Another important issue in SDR is to explore the effectiveness with which 
information can be extracted from retrieved spoken documents. The linear na- 
ture of speech means that accessing information from within individual spoken 
documents is fundamentally slower and more difficult than from text documents 
which can be scanned visually very rapidly. For relevance judgement it is time 
consuming to scan through audio files in order to overview their contents. For 
this reason most SDR systems make use of some form of graphical visualisation 
and often display the automatically transcribed document contents in the user 
interface m [iEi. Although not previously used in SDR, the graphical methods 
outlined in Sect. ^ might be used to assist relevance judgement here. 

Comparing the automatic transcription in Fig. 0 with the manual one shown 
in Fig. CJ it can be seen that there are many mistakes. Some important factual 
information is preserved, but other details are completely lost. This illustrates an 
interesting aspect of speech recognition quality and its evaluation with respect 
to lA. While it has been shown in various studies that it is possible to achieve 
good SDR performance, even with a high number of recognition errors in the 
document transcription, the transcriptions may be ineffective for addressing user 
information needs. For example, although we have the correct day of the meeting 
in the automated transcription of example message M524p003, we have lost the 
time. This highlights the importance of making the maximum use of the original 
data in the lA process and further motivates the use of visualisation in SDR 
interfaces. In these interfaces as well as reviewing the text transcription, SDR 
systems typically allow the user to play back the audio file itself, thus limiting 
the impact of recognition errors on the information extraction process mm- 



3.2 Document Image Retrieval 

While most contemporary documents are available in online electronic form 
many archives exist only in printed form. This is particularly true of impor- 
tant historical documents, but also many comparatively recent documents even 
if originally produced electronically are now only available in their final printed 
form. In order to automatically retrieve items of this type their content must be 
indexed by scanning and then applying some form of Optical Character Recog- 
nition (OCR) process to transcribe the document contents. 

Research in document image retrieval has covered a number of topics related 
to the identification of relevant documents in various application tasks. However, 
work in actual ad hoc retrieval of documents in response to a user search request 
has been concentrated on a limited number of studies. The most extensive work 
has been carried out over a number of years at the University of Nevada, Las 
Vegas where retrieval effectiveness for a number of different collections, 
indexing technologies and retrieval methods has been explored. Another study 
into the effect of recognition errors for document image retrieval focusing on 
Japanese text is reported in HE!. Document image retrieval was the focus of the 
Confusion Track run as part of TREC 5 uni. This was a known-item search 
task and a number of participating groups reported results using a variety of 
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indexing methods. Techniques explored included the use of indexing using n- 
gram sequences, methods to allow for substitution, deletion or insertion of letters 
in indexing terms, and attempting to correct OCR errors using dictionary-based 
methods. The Confusion Track has not been run at subsequent TREC evaluation 
where the focus has moved to the SDR task. A detailed review of research in 
document image retrieval is given in m 

Assessing the relevance of retrieved documents could be achieved by presen- 
tation of the first section of document image to the user. Selected document 
images can then be shown to the user in their entirety. Navigation within docu- 
ments might be achieved using some form of graphical representation of the dis- 
tribution of search terms similar to those developed for SDR interfaces. There 
is little difference between this scenario for document images and retrieval of 
typed electronic text, except that the document contents would only be search- 
able by approximate matching with the output of the OCR system. Occurrences 
of search terms in the displayed document images could be highlighted to assist 
with browsing, but again this can only be based on an approximate match, so 
may be errorful. 



4 Cross-Language Multimedia Retrieval 

There has so far been very little work in the area of Cross-Language Multimedia 
Information Retrieval (CLMIR). This is a potentially important future research 
topic as the growth of multilingual and multimedia document collections is likely 
to lead inevitably to the growth of multilingual multimedia collections. This 
section briefly reviews existing work in CLMIR and suggests some directions for 
further research. 

There are few examples of published work in Cross-Language Speech Re- 
trieval (CLSR). A study carried out at ETH, Zurich used French language text 
requests to retrieve spoken German news documents izq. The requests were 
translated using a similarity thesaurus constructed using a parallel collection 
of French and German new stories. A more recent study reported in ex- 
plores the retrieval of English voice-mail messages from the VMRl collection with 
French text requests using request translation performed with a dictionary-based 
method and a standard MT system. Results from these investigations suggest 
that standard CLIR techniques, such as pseudo relevanee feedback are effec- 
tive for CLSR, and that retrieval performance degradations arising from CLIR 
and SDR are additive. However, these are both small scale studies and their 
conclusions need to be verified on much larger collections. A review of existing 
technologies applicable to CLSR is contained in m- 

It is not clear how lA should be approached for a CLSR tasks. When the 
example VMRlb message shown in Fig. Q is applied to the Power Translator 
Pro MT System the French translation shown in Fig. Elis produced. For a user 
with a moderate level of knowledge of French language it can be seen that this 
translation is generally fairly impressive, clearly indicating the content of the 
message. Assuming that this translation was not available, a French speaker 
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M524p003: [tongue_click] 0 K, [pause] j’ai arrange un temps et une 
date que tout le monde sur le projet peut faire finalement. [loud_breath] 
[urn] [tongue_click] Le temps est dix heures et la date est mercredi 
les vingt cinquieme. [pause] [ah] lundi et mardi etaient dehors 
malheureusement , [loud_breath] [ah] si n’ imports qui ne peut pas faire 
ce temps et la date peut s’il vous plait ils m’ont laisse savoir des 
que possible [pause] et j’arrangerai [pause] la reunion pour un autre 
[loud_breath] temps mutuellement acceptable . [loud_breath] La chose 
principals que nous allons discuter est la date limits prochaine. 

Fig. 3. Example of manual VMR message transcription with transcriber inserted 
punctuation translated into French using Power Translator Pro. 

M524p003: LE K. R. 0. FINALLY A ARRANGE UN TEMPS ET UNE LUMIERE DU 

PRODUIT DU RIVAL ET N’APPRENDRA PAS CE TEMPS qUE CE TYPE DE CANDIDAT 
EST MERCREDI LES VINGT CINqUIEME PENDANT qUE DIRECTEUR DE LA ROUTE ET 
POUR LA VOITURE TOTALE COMME CE TEMPS DANS CONTROLE DIRECT ET DES qUE 
POSSIBLE ET GAMME D’UNE REUNION POUR UN SUPPORT NUCLEAIRE SUR LA CHOSE 
PRINCIPALE NOUS NE DISCUTONS JAMAIS LES NOUVELLES qUI GENRE DE LIGNE 

Fig. 4. Examples of 20K Large Vocabulary Speech Recogniser VMR message 
transcription translated into French using Power Translator Pro. 



without any knowledge of English may find the graphical visualisation and gloss 
translation strategies useful in making relevance judgements for this document. 
However, they might experience considerable difficulty in extracting information 
from a relevant document transcription. This latter problem would, of course, be 
much more significant for most users if the document were originally in Chinese. 

Figure EJshows the output of applying the example LVR transcription shown 
in Fig. El to the Power Translator Pro translation system. Once again the trans- 
lation is a respectable version of the English data input. The primary problem 
here though is with the information contained in the transcribed English input 
to the MT system. Thus a key fundamental unresolved research issue is how this 
incorrectly transcribed information can be accessed across languages by non- 
specialists in the document language. The user can listen to the soundtrack, or 
perhaps seek the assistance of a professional translation service, but this does not 
provide a solution to the problem of rapid automated CLIA for users unfamiliar 
with the document language. 

So far there have not been any reported research results in cross-language 
retrieval from document images collections. A review of the technologies and 
possible approaches to this is given in ES|, but research results are not reported. 
One problem for cross-language document image retrieval relates to the trans- 
lation of the output of OCR either for retrieval or content access. A feature of 
OCR systems is that they make errors in the recognition of individual characters 
within a word. These errors can sometimes be corrected in post processing, but 
often they cannot. These recognised “words” are not present in standard dic- 
tionaries and thus cannot be translated directly, either by an MT system or by 
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simple dictionary lookup. A method of approximate matching with dictionary 
entries, perhaps involving steps such as part-of-speech matching and word co- 
occurrence analysis, might prove effective, but there will remain the possibility 
of translation errors which result from incorrect word recognition. 

These translation problems will impact on the accuracy of translations pre- 
sented to the user for relevance assessment and information extraction. Problems 
similar to those illustrated for SDR in Fig. 2]may result, but the extent of this 
problem needs to be explored experimentally. 

In conclusion, in order to advance research in CLMIR there is a need for stan- 
dard test collections, either based on existing monolingual multimedia retrieval 
collections or developed specifically to support research in CLMIR. 



5 Concluding Remarks 

This paper has suggested some new research tasks in CLIR designed to ad- 
dress some important challenges of cross-language access to linguistic informa- 
tion. Specifically it has looked at topics in assessing the relevance of retrieved 
documents and extracting information from individual documents, and current 
research in multimedia retrieval and its extension to a cross-language environ- 
ment. 
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Abstract. Improvement in cross-language information retrieval results 
can come from a variety of sources - failure analysis, resource enrichment 
in terms of stemming and parallel and comparable corpora, use of pivot 
languages, as well as phonetic transliteration and Romanization. Appli- 
cation of these methodologies should contribute to a gradual increase in 
the ability of search software to cross the language barrier. 



1 Failure Analysis 

In my opinion there has been a dearth of detailed failure analysis in cross- 
language information retrieval, even among the best-performing methods in com- 
parative evaluations at TREC, CLEF, and NTCIRjZl. Just as a post-mortem can 
determine causation in mortality, a query-by-query analysis can often shed light 
on why some approaches succeed and others fail. Among the sets of queries 
utilized in these evaluations we always find several queries which all partici- 
pants perform poorly (as measured by the median precision over all runs for 
that query) . When the best performance is significantly better than the median 
it would be instructive to determine why that method succeeded while others 
failed. If the best performance is not much better than the median, then some- 
thing inherently difficult in the topic description presents a research challenge to 
the CLIR community. Two examples are illustrative, one from TREC and the 
other from CLEF. 

The TREC-7 conference was the first multilingual evaluation where a particu- 
lar topic language was to be run against multiple language document collections. 
The collection languages were the same as in CLEF (English, French, German, 
Italian). Topic 36, whose English title is “Art Thefts” has the French trans- 
lated equivalent “Les voleurs d’art” . The Altavista Babblefish translation of the 
French results in the phrase “The robbers of art”, which grasps the significance, 
if not the additional precision of the original English. However, when combined 
with aggressive stemming, the meaning can be quite different. The Berkeley 
French — s-Multilingual first stemmed the word ‘voleurs’ to the stem ‘vol’, and 
the translation of this stem to English is ‘flight’ and to German ‘flug,’ signifi- 
cantly different from the original unstemmed translation. In fact our F — >EFGI 
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performance for this query was 0.0799 precision versus our E — >EFGI precision 
of 0.3830. 

For the CLEF evaluation, one query provides a significant example of the 
challenges facing CLIR, even with a single language such as English. Query 40 
about the privatization of the German national railway was one which seems to 
have presented problems with all participating groups (the median precision over 
all CLEF multilingual runs was 0.0537 for this query). As an American group, the 
Berkeley group was challenged by the use of the English spelling ‘privatisation’ 
which couldn’t be recognized by any machine translation softwares. The German 
version of the topic was not much better - in translation its English equivalent 
became ‘de-nationalization’ a very uncommon synonym for ‘privatization,’ and 
one which yielded few relevant documents. By comparison, our German manual 
reformulation of this query resulted in an average precision of 0.3749 for best 
CLEF performance for this query. 

These examples illustrate that careful post-evaluation analysis might provide 
the feedback which can be incorporated into design changes and improved system 
performance. 

2 Resource Enrichment 

2.1 Stemmers and Morphology 

The CLEF evaluation seems to be the first one in which significant experiments 
in multiple language stemming and morphology was used. Some groups devel- 
oped “poor man” stemmers by taking the corpus word lists and developing stem 
classes based upon common prefix strings. The Chicago group applied their auto- 
matic morphological analyzer to the CLEF collections to generate a custom stem- 
mer for each language’s collection |S|, while the Maryland group extended the 
Chicago approach by developing a four-stage statistical stemming a,nr)roa,ch|14|. 
The availability of the Porter stemmers in French, German and Italian (from 
http://open.muscat.com/) also heavily influenced CLEF entries. The conclusion 
seems to be that stemming plays an important role in performance improvement 
for non-English European languages, with results substantially better than for 
English stemming. 



2.2 Parallel Corpora and Web Mining 

Parallel corpora have been recognized as a major resource for CLIR. Several 
entries in CLEF, in particular the Johns Hopkins APL^^ used aligned parallel 
corpora in French and English from the Linguistic Data Consortium. More re- 
cently emphasis has been given toward mining the WWW for parallel resources. 
There are many sites, particularly in Europe, which have versions of the same 
web page in different languages. Tools have been built which extract parallel 
bilingual corpora from the web O [E! ■ These were applied in CLEF by the 
Montreal Croiir)|T^ and the Twente/TNO group[S| 
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2.3 Comparable Corpora Alignment 

Comparable corpora are bilingual corpora which can be created through align- 
ment of similar documents on the same topic in different languages. An example 
might be the foreign edition of a newspaper where stories about the same news 
item are written independently. Techniques for alignment require relaxation of 
time position (a story might appear a few days later) and the establishment of 
the contextual environment of topic. There has been research into the statistical 
alignment of comparable corpora by Picchi and Peters with Italian and English 
US) and Fung with English and Chinese 0 but the techniques have not made 
their way into general practice. Comparable corpora will only become widely 
used if tools for their acquisition are created as open-source software and tools 
for their alignment are refined and also made available. 



2.4 Geographic and Proper Names 

A major need is to provide geographic and proper name recognition across lan- 
guages. Proper names are often not in either machine translation programs or 
bilingual dictionaries, nor are geographic place names. A particular case in point 
was the TREC-6 cross language query CLl about Austrian President Kurt Wald- 
heim’s connection with Nazism during WW II - one translation system trans- 
lated from the German ‘Waldheim’ to English ‘forest home’. 

It has been suggested that more than thirty percent of content bearing 
words from news services are proper nouns, either personal and business enter- 
prise names or geographic place name references. The availability of electronic 
gazetteers such as: 

— National Imagery and Mapping Agency’s country name files: 
http://164.214.2.59/ gns /html/ Cntry _F iles .html 

— Census Bureau’s gazetteer for United States: 
http://tiger.census.gov/ 

— Arizona State University’s list of place name servers 

http:/ / WWW. asu.edu/lib/hayden/govdocs/maps/geogname .htm 

— Global Gazeteer of 2880532 cities and towns around the world 
http://www.caUe. com / world / 

give some hope that geographic name recognition could be built into future CLIR 
systems. 

While work has been done on extracting proper nouns in English and some 
other languages through the Message Understanding Conference series, it is not 
clear that anyone has mined parallel texts to create specialized bilingual lexicons 
of proper names. 



86 



Fredric C. Gey 



3 Pivot Languages 

In multilingual retrieval between queries and documents in n languages, one 
seems to be required to possess resources (machine translation, bilingual dic- 
tionaries, parallel corpora, etc.) between each pair of languages. Thus 0{ii?) 
resources are needed. This can be approximated with the substitution of transi- 
tivity among 0{n) resources if a general purpose pivot language is used. Thus to 
transfer a query from German to Italian, where machine translation is avail- 
able from German to English and English to Italian respectively, the query 
is translated into English and subsequently into Italian, and English becomes 
the pivot language. This method was used by the Berkeley group in TREG-7 
0 and GLEF0. The Twente/TNO group has utilized Dutch as a pivot lan- 
guage between pairs of language where direct resources were unavailable in both 
TREG-8 and GLEFjOJ. One can easily imagine that excellent transitive ma- 
chine translation could provide better results than poor direct resources such 
as a limited bilingual dictionary. In some cases resources may not even exist 
for one language pair - this will be come increasingly common with the in- 
crease in the number of languages for which cross-language information search 
is desired. For example, a GLIR researcher may be unable to find an electronic 
dictionary resource between English and Malagasy (the language of Madagas- 
car), but there are French newspapers in this former colony of France where 
French is still an official language. Thus, an electronic French-Malagasy dic- 
tionary may be more complete and easier to locate than an English-Malagasy 
one. Similarly the Russian language may provide key resources to transfer words 
from the Pashto (Afgan), Farsi, Tajik, and Uzbek languages (see, for example, 
http: / /members. tripod.com/Groznijat /b Jang/bl_sourc.html) . 

4 Phonetic Transliteration and Romanization 

One of the most important and neglected areas in cross-language information 
retrieval is, in my opinion, the application of transliteration to the retrieval pro- 
cess. The idea of transliteration in GLIR derives from the suggestion by Buck- 
ley in the TREG-6 conference that for English-French GLIR “English query 
words are treated as potentially mis-spelled French words.” P In this way En- 
glish query words can be replaced by French words which are lexicographically 
similar and the query can be can proceed monolingually. More generally, we can 
often find that many words, particularly in technology areas, have been borrowed 
phonetically from English and are pronounced similarly, yet with phonetic cus- 
tomization in the borrower language. The problems of automatic recognition of 
phonetic transliteration has been studied by Knight and Graehl for the Japanese 
katakana alphabet 0 and by Stalls and Knight for Arabic^!]. Another kind of 
transliteration is Romanization, wherein an unfamiliar script, such as Gyrillic, is 
replaced by its Roman alphabet equivalent. When done by library catalogers, the 
transformation is one-to-one, i.e. the original script can be recovered by reverse 
transformation. This is not the case for phonetic transliteration where more than 
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one sound in the source language can project to a single representation in the 
target language. The figure below comes from the entry for ‘economic policy’ 
in the GIRT special domain retrieval thesaurus of CLEF|^. The GIRT cre- 
ators have provided a translation of the thesaurus into Russian which our group 



- <list> 

<oenn art>W ^Its itui iiolii Ht</senri ar» 

jKO ko mm h no/iMTMKfl c /m ssian? 

<tran sl eko no m iches p d I Itika </trari sl[t> 
</entnf> 

<yiist> 

Fig. 1. German-Russian GiRT Thesaurus with Transliteration 

has transliterated into its Roman equivalent using the U.S. Library of Gongress 
specification (see http://lcweb.loc.gov/rr/european/lccyr.html). It is clear that 
either a fuzzy string or phonetic search with English words ‘economic’, ‘policy’, 
or ‘politics’ would retrieve this entry from the thesaurus or from a collection 
of Russian documents. Generalized string searches of this type have yet to be 
incorporated into information retrieval systems. 

5 Summary and Acknowlegments 

This paper has presented a personal view of what developments are needed 
to improve cross-language information retrieval performance. Two of the most 
exciting advances in cross-language information retrieval are mining the web 
for parallel corpora to build bi-lingual lexicons and the application of phonetic 
transliteration toward search in the absence of translation resources. Gomparable 
corpora development, which has perhaps the greatest potential to advance the 
field, has yet to achieve its promise in terms of impact, probably because of the 
lack of generally available processing tools. 

I wish to thank Hailing Jiang and Aitao Ghen for their support in running 
a number of experiments and Natalia Perelman for implementing the Russian 
transliteration of the GIRT thesaurus. Major funding was provided by DARPA 
(Department of Defense Advanced Research Projects Agency) under research 
grant N6600I-00-I-89II, Mar 2000-Feb 2003 as part of the DARPA Translingual 
Information Detection, Extraction, and Summarization Program (TIDES). 



References 

[1] C Buckley, J Walz, M Mitra, and C Cardie. Using clustering and suberconcepts 
within smart: Trec-6. In E.M. Voorhees and D. K. Harman, editors, The Sixth 



Fredric C. Gey 



Text REtrieval Conference (TREC-6), NIST Special Publication 500-2A0, pages 
107-124, August 1998. 

[2] Pascal Fung. A statistical view on bilingual lexicon extraction: From parallel 
corpora to non-parallel corpora. In D Farwell L Gerber E Hovy, editor, Pro- 
ceeding of AMTA-98 Conference, Machine Translation and the Information Soup 
Pennsylvania, USA, October 28-31, 1998, pages 1-16. Springer- Verlag, 1998. 

[3] F. C. Gey, H. Jiang, and A. Chen. Manual queries and machine tranlation in cross- 
language retrieval at trec-7. In E.M. Voorhees and D. K. Harman, editors, The 
Seventh Text REtrieval Conference (TREC-7), NIST Special Publication 500-242, 
pages 527-540. National Institute of Standards and Technology, July 1999. 

[4] Fredric Gey, Hailing Jiang, Vivien Petras, and Aitao Chen. Cross-language re- 
trieval for the clef collections - comparing multiple methods of retrieval. In this 
volume. Springer, 2000. 

[5] John Goldsmith, Darrick Higgins, and Svetlana Soglasnova. Automatic language- 
specific stemming in information retrieval. In this volume. Springer, 2000. 

[6] Djoerd Hiemstra, Wessel Kraaij, Renee Pohlmann, and Thijs Westerveld. Trans- 
lation resources, merging strategies, and relevance feedback for cross-language 
information retrieval. In this volume. Springer, 2000. 

[7] Noriko Kando and Toshihiko Nozue, editors. Proceedings of the First NTCIR 
Workshop on Japanese Text Retrieval and Term Recognition. NACSIS (now Na- 
tional Informatics Institute, Tokoyo, 1999. 

[8] Michael Kluck and Fredric Gey. The domain-specific task of clef - structure 
and opportunities for specific evaluation strategies in cross-language information 
retrieval. In this volume. Springer, 2000. 

[9] K. Knight and J. Graehl. Machine transliteration. In Computational Linguistics, 
24(4), 1998. 

[10] Wessel Kraaij, Renee Pohlmann, and Djoerd Hiemstra. Twenty-one at trec-8: Us- 
ing language technology for information retrieval. In Ellen Voorhees and D Har- 
man, editors. Working Notes of the Eighth Text REtrieval Conference (TREC-8), 
pages 203-217, November 1999. 

[11] Paul McNamee, James Mayfield, and Christine Piatko. A language- independent 
approach to european text retrieval. In this volume. Springer, 2000. 

[12] Jian-Yun Nie, Michel Simard, and George Foster. Multilingual information re- 
trieval based on parallel texts from the web. In this volume. Springer, 2000. 

[13] Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. Cross- 
language information retrieval based on parallel texts and automatic mining of 
parallel texts from the web. In SIGIR ’99: Proceedings of the 22nd Annual Inter- 
national ACM SIGIR Conference on Research and Development in Information 
Retrieval, August 15-19, 1999, Berkeley, CA, USA, pages 74-81. ACM, 1999. 

[14] Douglas Card, Gina-Anne Levow, and Clara Cabezas. Clef experiments at the 
university of maryland: Statistical stemming and backoff translation strategies. 
In this volume. Springer, 2000. 

[15] Eugenio Picchi and Carol Peters. Cross language information retrieval: A system 
for comparable corpus querying. In Gregory Greffenstette, editor, Cross Language 
Information Retrieval, pages 81-91. Kluwer, 1998. 

[16] Phillip Resnick. Mining the web for bilingual text. In Proceedings of 37th Annual 
Meeting of the Association for Computational Linguistics (ACL’99), College Park, 
Maryland, June 1999, 1999. 

[17] B. Stalls and K. Knight. Translating names and technical terms in arabic text. In 
Proc of the COLING/ACL Workshop on Computational Approaches to Semitic 
Languages, 1998, 1998. 




CLEF 2000 - Overview of Results 



Martin Braschler 



Eurospider Information Technology AG 
Schaffhauserstr. 18 
8006 Zurich, Switzerland 
braschler@eurospider . ch 



Abstract. The first CLEF campaign was a big success in attracting increased 
participation when compared to its predecessor, the TREC8 cross-language 
track. Both the number of participants and of experiments has grown 
considerably. This paper presents details of the various subtasks, and attempts 
to summarize the main results and research directions that were observed. 
Additionally, the CLEF collection is examined with respect to the completeness 
of its relevance assessments. The analysis indicates that the CLEF relevance 
assessments are of comparable quality to those of the well-known and trusted 
TREC ad-hoc collections. 



1 Introduction 

CLEF 2000 has brought a substantial increase in the number of participating groups 
compared to its predecessor, the TREC8 cross-language (CLIR) track [1]. This means 
that the number and diversity of experiments that were submitted has also increased. 
The following report tries to summarize the main results and main research directions 
that were observed during the first CLEF campaign. 

Multilingual retrieval was the biggest subtask in CLEF, and also received the most 
attention. Therefore, it will be the main focus of this paper. That the majority of 
participants tried to tackle this subtask is an encouraging sign. It is evidence that these 
groups try to adapt their systems to a multitude of languages, instead of focusing on a 
few obvious pairs. However, the smaller subtasks of bilingual and monolingual 
retrieval served important purposes as well, both in terms of helping to better 
understand the characteristics of individual languages, as well as in attracting new 
groups that have not previously participated in the TREC CLIR track or any other 
TREC track. 

In the following, details with respect to the number of runs for the subtasks and 
different languages are given. The discussion continues with a summary of some 
defining characteristics of individual experiments by the participants, and 
comparisons of the results that were obtained. Finally, the resulting CLEF test 
collection is investigated for the completeness of its relevance assessments. 
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2 Subtasks 



In total, 20 groups from 10 different countries participated in one or more of the 
subtasks that were offered for CLEF 2000 (see Table 1). Of these, 16 did some form 
of cross-language experiments (either multilingual, bilingual or both), while the 
remaining 4 concentrated exclusively on monolingual retrieval. Three groups worked 
on the GIRT domain-specific subtask. Nine groups participated in more than one 
subtask, but no group tried all four. 



Table 1. List of participants 



CWI (Netherlands) 

Eurospider (Switzerland) 
lAI (Germany) 

IRIT (France) 

ITC-irst (Italy) 

Johns Hopkins Univ./APL (USA) 
New Mexico State Univ. (USA) 
Syracuse Univ. (USA) 

TNO/Univ. Twente (Netherlands) 
Univ. Chicago (USA) 



Univ. Dortmund (Germany) 

Univ. Glasgow (UK) 

Univ. Maryland (USA) 

Univ. Montreal/RALI (Canada) 

Univ. Salamanca (Spain) 

Univ. Sheffield (UK) 

Univ. Tampere (Finland) 

Univ. of California at Berkeley (USA) 
West Group (USA) 

Xerox XRCE (France) 



Table 2 compares the number of participants and experiments to those of earlier 
TREC CLIR tracks. 

Please note that in TREC6, only bilingual retrieval was offered, which resulted in a 
large number of runs combining different pairs of languages [10]. Starting with 
TREC7, multilingual runs were introduced, which usually consist of multiple runs for 
the individual languages that are later merged. The number of experiments for TREC6 
is therefore not directly comparable to later years. 



Table 2. Development in the number of participants and experiments 



Year 


# Participants 


# Experiments 


TREC6 


13 


(95) 


TREC7 


9 


27 


TREC8 


12 


45 


CLEF 


20 


95 



CLEF was clearly a breakthrough in promoting larger participation. While the 
number of participants stayed more or less constant in the three years that the CLIR 
track was part of TREC, this number nearly doubled for the first year that CLEF was 
a stand-alone activity. 

A total of 95 individual experiments were submitted, also a substantial increase 
over the number in the TREC8 CLIR track. A breakdown into the individual subtasks 
can be found in Table 3. 
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Table 3. Experiments listed by subtask 



Subtask 


# Participants 


#Runs 


Multilingual 


11 


28 


Bilingual 


10 


27 


Monolingual French 


9 


10 


Monolingual German 


11 


13 


Monolingual Italian 


9 


10 


Domain-specific GIRT 


3 


7 



All topic languages were used for experiments, including the translations of the 
topics into Dutch, Finnish, Spanish and Swedish, which were provided by 
independent third parties. German and English were the most popular topic languages, 
with German being used slightly more than English. However, this is partly due to the 
fact that English was not an eligible topic language for the bilingual and monolingual 
subtasks. Table 4 shows a summary of the topic languages and their use. 



Table 4. Experiments listed by topic language 





Language 


#Runs 


English 




26 


French 




17 


German 




29 


Italian 




11 


Others 




13 



A large majority of runs (80 out of 95) used the complete topics, including all 
fields. Since it is generally agreed that using such lengthy expressions of information 
needs does not well reflect the realities of some applications such as web searching, it 
probably would be beneficial if the number of experiments using shorter queries 
increases in coming years. Similarly, the number of manual experiments was low (6 
out of 95). Manual experiments are useful in establishing baselines and in improving 
the overall quality of relevance assessment pools. Therefore, an increase in the 
number of these experiments would be welcome; especially since they also tend to 
focus on interesting aspects of the retrieval process that are not usually covered by 
batch evaluations. 



3 Characteristics of Experiments 

Table 5 shows a summary of the use of some core elements of multilingual 
information retrieval in the participants' systems. Most groups that experimented with 
cross-language retrieval concentrated on query translation, although two groups. 
University of Maryland and Eurospider, also tried document translation. 

There is more variation in the type of translation resources that were employed. A 
majority of systems used some form of a dictionary for at least one language 
combination. There is also a sizeable number of participants that experimented with 
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translation resources that were constructed automatically from corpora. Lastly, some 
groups either used commercial machine translation (MT) systems or manual query 
reformulations. A lot of groups combined more than one of these types of translation 
resources, both by using different types for different languages or by using more than 
one type for individual language pairs. 

Table 5. Some main characteristics of experiments by individual participants 




Trans. Approach 












Query translation • 
Document trans. 

No translation 


• • 
• 


• • • 
• 


• • • 
• 


• • • 4 

• 


» • • 

• • 


Trans. Resources 













Dictionary •••• ••• •••• ••• 

Corpus-based • • • • • • 

MT • • • • • 

Manual • • 



Ling. Processing 


Decompounding • • • 


• • • • • 
• 


• • • • • 4 

• • 


► • ? 

• 



Considerable effort was invested this year in stemming and decompounding issues. 
This may be partly due to increased participation by European groups, which 
exploited their intimate knowledge of the languages in CLEF. Nearly all groups used 
some form of stemming in their experiments. Some of these stemming methods were 
elaborate, with detailed morphological analysis and part-of-speech annotation. On the 
other hand, some approaches were geared specifically towards simplicity or language- 
independence, with multiple groups relying on statistical approaches to the problem. 
The German decompounding issue was also addressed by several groups, using 
methods of varying complexity. 

Some additional noteworthy characteristics include: 

• A new method for re-estimating translation probabilities during blind relevance 
feedback by the TNO/University of Twente group [5]. 

• Extensive GIRT experiments, including the use of the GIRT thesaurus, by the 
University of California at Berkeley [3]. 

• The use of 6-grams as an alternative to stemming/decompounding by Johns 
Hopkins University/APL [7]. 

• The use of lexical triangulation, a method to improve the quality of translations 
involving an intermediary pivot language, by the University of Sheffield [4]. 
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• Mining the web for parallel texts, which can then be used in corpus-based 
approaches. This was used by University of Montreal/RALI [9] and the 
TNO/University of Twente group, as well as Johns Hopkins University/APL. 

• The combination of both document translation and query translation, by Eurospider 

[ 2 ]. 

• Interactive experiments by New Mexico State University. 

For a detailed discussion of these and more characteristics, please refer to the 
individual participants' papers in this volume. 



4 Results 

4.1 Multilingual 

Eleven groups submitted results for the multilingual subtask. Since for many of these 
groups this subtask was the main focus of their work, they sent multiple different 
result sets. Figure 1 shows the best experiments of the five top groups in the 
automatic category for this subtask. 

It is interesting to note that all five top groups are previous TREC participants, 
with one of them going all the way back to TRECl (Berkeley). These groups 
outperformed newcomers substantially. This may be an indication that the "veteran" 
groups benefited from the experience they gained in previous years, whereas the new 
groups still experienced some "growing pains". It will be interesting to see if the 
newcomers catch up next year. The two top performing entries both used a 
combination of translations from multiple sources. The entry from Johns Hopkins 
University achieved good performance even though avoiding the use of language- 
specific resources. 



4.2 Bilingual 

The best results for the bilingual subtask come from groups that also participated in 
the multilingual subtask (see Figure 2). Additionally, University of Tampere and CWI 
also submitted entries that performed well. Both these entries use compound-splitting 
for the source language (German and Dutch, respectively), which likely helped to get 
a better coverage in their dictionary-based approaches. 
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CLEF 2000 Multilingual Task - Autamatic 




Fig. 1. The best entries of the top five performing groups for the multilingual subtask. 



CLEF 2000 Bilingual Task - Automatic 




Fig. 2. The best entries of the top five performing groups for the bilingual subtask. 



4.3 Monolingual 

Some of the best performing entries in the monolingual subtask came from groups 
that did not conduct cross-language experiments and instead concentrated on 
monolingual retrieval. Two such groups are West Group and ITC-irst, which 
produced the top -performing French and Italian entries, respectively (see Figure 3 and 
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4). Both groups used elaborate morphological analysis in order to obtain base forms 
of query words and document terms. However, the performance of the top groups in 
French and Italian monolingual retrieval is in general very comparable. 



CLEF 2000 Monolingual Task - Frenoh Automotio 




Fig. 3. The best entries of the top five performing groups for the French monolingual subtask. 



CLEF 2000 Monolingual Task - Italian Automatio 




Fig. 4. The best entries of the top five performing groups for the Italian monolingual subtask. 



In contrast, the differences for German monolingual are substantially larger (see 
Figure 5). The best run by the top performing group outperforms the best entry by the 
fifth-placed group by 37% for German, whereas for French and Italian the difference 
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is only 20% and 13%, respectively. One likely explanation for the larger spread is the 
decompounding issue: the four best performing groups all addressed this peculiarity 
of the German language either by splitting the compounds (Eurospider, TNO, West 
Group) or through the use ofn-grams (Johns Hopkins). Especially the results by West 
Group seem to support the notion that decompounding was crucial to obtaining good 
performance in this subtask [8]. They report that stemming without decompounding 
gave practically no improvement in performance, whereas they gained more than 25% 
in average precision when adding decompounding. 



CLEF 2000 Monolingual Task - German Automatic 




Fig. 5. The best entries of the top five performing groups for the German monolingual subtask. 



4.4 GIRT 

Continuing the practice of the TREC8 cross-language track, a subtask dealing with 
domain-specific data was offered to CLEF participants. The data collection was an 
extended set of the German "GIRT" texts previously used in TREC-CLIR. The texts 
come from the domain of social science, and are written in German. Approximately 
three quarter (71%) of the texts have English titles, and around 8% have English 
abstracts. The texts also have controlled thesaurus terms assigned to them and the 
corresponding thesaurus was distributed to participants in German/English and 
German/Russian bilingual form. No group used the Russian version for official CLEF 
experiments. The main objective of the GIRT subtask is to investigate the use of this 
thesaurus, as well as the use of the English titles and abstracts, for monolingual and 
cross-language information retrieval (see also [6]). 

Three groups submitted a total of seven runs. Xerox focused on monolingual 
experiments, whereas University of California at Berkeley investigated only cross- 
language retrieval on this collection. University of Dortmund submitted results from 
both monolingual and cross-language experiments. 
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While the Dortmund group used machine translation, a range of different 
translation approaches was used by Berkeley: thesaurus lookup, "entry vocabulary 
module (EVM)" and machine translation. They used a combination of all three 
approaches as well, giving them superior performance to any of the single 
approaches. 



CLEF 2000 Domain-specific Task - GiRT 




Fig. 6. The best entries of the groups participating in the GIRT domain-specific subtask. 



5 Completeness of Relevance Assessments 

The results reported in this paper rely heavily on the concept of judging the relevance 
of documents with respect to given topics. The relevance of documents is usually 
judged by one or more human "assessors", making this a costly undertaking. These 
"relevance assessments" are then used for the calculation of the recall/precision 
figures that underlie the graphs and figures presented here and in the appendix of this 
volume. 

It is therefore not surprising that the quality of the relevance assessments is of 
concern to the participants. Indeed, with evaluation forums such as TREC becoming 
more and more popular, this issue has been frequently raised in the last few years. 
Two main concerns can be discerned: 



Concern 1: The "quality" of the relevance judgments. Of concern is the ability of the 
persons doing the assessment to sufficiently understand the topics and documents, and 
the consistency of the judgments (no personal bias, clear interpretation of the judging 
guidelines, etc.). On the one hand, it has been shown that agreement between 
assessors, when documents are judged by more than one person, is usually rather low. 
On the other hand, it has also been repeatedly demonstrated that while this 
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disagreement can change absolute performance figures, the overall ranking of the 
systems remains stable. 



Concern 2: The "completeness" of the relevance judgments. Of concern is the use of 
so-called "pooling methods". The use of human judges for relevance makes it 
impractical to judge every document in today's large scale test collections. Therefore, 
only a sample of documents, namely those retrieved with high scores by the evaluated 
systems, is judged. All unjudged documents are assumed to be not relevant. The 
assertion is that a sufficient number of diverse systems will turn up most relevant 
documents this way. A potential problem is the usability of the resulting test 
collection for the evaluation of a system that did not contribute to this "pool of judged 
documents". If such a system retrieves a substantial number of unjudged documents 
that are relevant, but were not found before, it is unfairly penalized when calculating 
the evaluation measures based on the official relevance assessments. It has been 
shown that the relevance assessments for the TREC collection are complete enough to 
make such problems unlikely. An investigation into whether the same is true for 
CLEF follows below. 

In order to study the quality of the relevance assessments, multiple sets of 
independent judgments would be needed. These are not available, which means that 
the subsequent discussion will be limited to the question of the completeness of the 
assessments. Since CLEF closely follows the practices of TREC in the design of the 
topics and the guidelines for assessment, and since NIST, the organizer of TREC, 
actively participates in the coordination of CLEF, the quality of the assessments in 
general can be assumed to be comparable (see [12] for an analysis of the TREC 
collections). 

One way to analyze the completeness of the relevance judgments is by focusing on 
the "unique relevant documents" [13]. For this purpose, an unique relevant document 
is defined as a document which was judged relevant with respect to a specific topic, 
but that would not have been part of the pool of judged documents had a certain group 
not participated in the evaluation. I.e., only one group retrieved the document with a 
score high enough to have it included in the judgment pool. This addresses the 
concern that systems not directly participating in the evaluation are unfairly 
penalized. By subtracting relevant documents only found by a certain group, and then 
reevaluating the results for this group, we simulate the scenario that this group was a 
non-participant. The smaller the change in performance that is observed, the higher is 
the probability that the relevance assessments are sufficiently complete. 

For CLEF, this kind of analysis was run for the experiments that were submitted to 
the multilingual subtask. A total of twelve sets of relevance assessments were used: 
the original set, and eleven sets that were built by taking away the relevant documents 
uniquely found by one specific participant. The results for every multilingual 
experiment were then recomputed using the set without the group-specific relevant 
documents. Figure 7 shows the number of unique relevant documents per group 
participating in CLEF. The key figures obtained after rerunning the evaluations can be 
found in Table 6 and Figure 8. 
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Fig. 7. Number of unique relevant documents contributed by each CLEF participant for the 
multilingual subtask. 



Table 6. Key figures for investigation into the effect of "unique relevant documents" on the 
recall and precision measures. Presented are the observed changes in mean average precision. 

Mean absolute diff. 0.0013 Mean diff. in percent 0.80% 

Max absolute diff. 0.0059 Max diff. in percent 5.99% 

Standard deviation 0.0012 Standard deviation 1.15% 




Fig. 8. Changes in mean average precision (absolute values) for all multilingual runs submitted 
to CLEF. The majority of runs experiences a change of less than 0.002. 

These numbers were calculated based on the absolute values of the differences. Note 
that even though relevant documents are removed from the evaluation, mean average 
precision can actually increase after recalculation due to interpolation effects. The 
figures reported for TREC in [11] are based on signed numbers, and therefore not 
directly comparable. For CLEF, calculating these numbers the TREC way, the mean 
difference is -0.0007, equivalent to a change of -0.57 percent. This compares 
favorably with an observed mean difference of -0.0019 (-0.78%) for TREC8 ad hoc 
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and -0.0018 (-1.43%) for TREC9 Chinese CLIR. The ranking of the systems is also 
very stable: the only two systems that switch ranks have an original performance 
difference of less than 0.1%, a difference that is well below any meaningful statistical 
significance. The relevance assessments for the CLEF test collection therefore seem 
to be well suited for evaluating systems that did not directly participate in the original 
evaluation campaign. 



6 Conclusions 

CLEF 2000 was a big success in attracting more participation. The participating 
groups submitted a diverse collection of experiments for all languages and subtasks. 
Some foci seem to have changed slightly from last year at the TREC8 cross-language 
track; specifically, increased European participation appears to have strengthened the 
emphasis on language-specific issues, such as stemming and decompounding. Those 
groups that concentrated on these issues had considerable success in the monolingual 
subtasks. The best performing cross-language experiments (the multilingual and 
bilingual subtasks) came from "veteran" TREC participants. It appears that these 
groups benefited from their experience, and it will be interesting to see if some of the 
newcomers can catch up in 200 1 . 

An investigation into the completeness of the relevance assessments for the CLEF 
collection, an important precondition for the usefulness of the test collection in future 
evaluations, produced encouraging numbers. This makes the collection an attractive 
choice for a wide range of evaluation purposes outside the official CLEF campaign. 
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Abstract. This paper describes the official runs of the Twenty-One 
group for the first CLEF workshop. The Twenty-One group participated 
in the monolingual, bilingual and multilingual tasks. The following new 
techniques are introduced in this paper. In the bilingual task we ex- 
perimented with different methods to estimate translation probabilities. 
In the multilingual task we experimented with refinements on raw-score 
merging techniques and with a new relevance feedback algorithm that 
re-estimates both the model’s translation probabilities and the relevance 
weights. Finally, we performed preliminary experiments to exploit the 
web to generate translation probabilities and bilingual dictionaries, no- 
tably for English-Italian and English-Dutch. 



1 Introduction 

Twenty-One is a project funded by the EU Telematics Applications programme, 
sector Information Engineering. The project subtitle is “Development of a Mul- 
timedia Information Transaction and Dissemination Tool” . Twenty-One started 
early 1996 and was completed in June 1999. Because the TREC ad-hoc and 
cross-language information retrieval (CLIR) tasks fitted our needs to evaluate the 
system on the aspects of monolingual and cross-language retrieval performance, 
TNO-TPD and University of Twente participated under the flag of “Twenty- 
One” in TREC-6 / 7 / 8. Since the cooperation is continued in other projects: 
Olive and Druid, we have decided to continue our participation in CLEF as 
“Twenty-One” f| For all tasks, we used the TNO vector retrieval engine. The 
engine supports several term weighting schemes. The principal term weighting 
scheme we used is based on the use of statistical language models for information 
retrieval as explained below. 



^ Information on Twenty-one, Olive and Druid is available at 
http : //dis . tpd . tno . nl/ 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 102- 11771 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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2 The Approach 

All runs were carried out with an information retrieval system based on a simple 
unigram language model. The basic idea is that documents can be represented 
by simple statistical language models. Now, if a query is more probable given 
a language model based on document than given e.g. a language model 
based on document then we hypothesise that the document is more 
relevant to the query than document . Thus the probability of generating a 
certain query given a document-based language model can serve as a score to 
rank documents with respect to relevance. 



P{Ti,T2,---,Tn\D)P{D) = P{D)l[{l-X,)P{T,) + KP{T,\D) (1) 

i=l 

Formula Q shows the basic idea of this approach to information retrieval, where 
the document-based language model is interpolated with a background language 
model to compensate for sparseness. In the formula, Ti is a random variable for 
the query term on position i in the query (1 < i < n, where n is the query 
length), which sample space is the set of all terms in the 

collection. The probability measure P{Ti) defines the probability of drawing a 
term at random from the collection, P(Ti\D) defines the probability of drawing 
a term at random from the document; and \ defines the importance of each 
query term. The marginal probability of relevance of a document P{D) might 
be assumed uniformly distributed over the documents in which case it may be 
ignored in the above formula. 

2.1 A Model of Cross-Language Information Retrieval 

Information retrieval models and statistical translation models can be integrated 
into one unifying model for cross-language information retrieval [215 1 . Let Si be a 
random variable for the source language query term on position i. Each document 
gets a score defined by the following formula. 



P{S^,S2,---,Sn\D)P{D) = 

n m 
i=l 3 = 1 

In the formula, the probability measure P{Si\Ti = defines the translation 
probabilities. 



2.2 Translation in Practice 

In practice, the statistical translation model will be used as follows. The au- 
tomatic query formulation process will translate the query Si, S 2 , ■ ■ ■ , Sn using 
a probabilistic dictionary. The probabilistic dictionary is a dictionary that list 
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pairs (s,t) together with their probability of occurrence, where s is from the 
sample space of Si and t is from the sample space of T^. For each Si there will be 
one or more realisations ti of Ti for which P{Si\Ti = ti) > 0, which will be called 
the possible translations of Si. The possible translations should be grouped for 
each i to search the document collection, resulting in a structured query. 

For instance, suppose the original French query on an English collection is 
“dechets dangereux” , then possible translations of “dechets” might be “waste” , 
“litter” or “garbage” , possible translations of “dangereux” might be “dangerous” 
or “hazardous” and the structured query can be presented as follows. 

((waste U litter U garbage), (dangerous U hazardous)) 

The product from i = 1 to n (in this case n = 2) of equation 0 is represented 
above by using the comma as is done in the representation of a query of length 2 
as Ti, T 2 . The sum from j = 1 to m of equation|2|is represented by displaying only 
the realisations of Ti for which P{Si\Ti) > 0 and by separating those by ‘U’. So, 
in practice, translation takes place during automatic query formulation (query 
translation), resulting in a structured query like the one displayed above that is 
matched against each document in the collection. Unless stated otherwise, when- 
ever this paper mentions ‘query terms’, it will denote the target language query 
terms: realisations of Ti. Realisations of Si, the source language query terms, will 
usually be left implicit. The combination of the structured query representation 
and the translation probabilities will implicitly define the sequence of the source 
language query terms ^i, S 2 , • • • , Sn, but the actual realisation of the sequence 
is not important to the system. 



2.3 Probability Estimation 

The prior probability of relevance of a document P{D), the probability of term 
occurrence in the collection P{Ti) and the probability of term occurrence in the 
relevant document P{Ti\D) are defined by the collection that is searched. For 
the evaluations reported in this paper, the following definitions were used, where 
tf{t, d) denotes the number of occurrences of the term t in the document d, and 
df(t) denotes the number of documents in which the term t occurs. Equation 
0is the definition used for the unofficial “document length normalisation” runs 
reported in section 0 






The translation probabilities P{Si\Ti) and the value of Xi, however, are unknown. 
The collection that is searched was not translated, or if it was translated, the 
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translations are not available. Translation probabilities should therefore be es- 
timated from other data, for instance from a parallel corpus. The value of Xi 
determines the importance of the source language query term. If Ai = 1 then 
the system will assign zero probability to documents that do not contain any of 
the possible translations of the original query term on position i. In this case, a 
possible translation of the source language term is mandatory in the retrieved 
documents. If Ai = 0 then the possible translations of the original query term 
on position i will not affect the final ranking. In this case, the source language 
query term is treated as if it were a stop word. For ad-hoc queries, it is not known 
which of the original query terms are important and which are not important 
and a constant value for each Ai is taken. The system’s default value is Ai = 0.3. 

2.4 Implementation 

Equation 0 is not implemented as is, but instead it is rewritten into a weighting 
algorithm that assigns zero weight to terms that do not occur in the document. 
Filling in the definitions of equation Qi and 0 in equation results in the 
following formula. The probability measure P(Si\Ti = t^^'^) will be replaced by 
the translation probability estimates Ti{j). 



The translation probabilities can be moved into the inner sum. As summing is 
associative and commutative, it is not necessary to calculate each probability 
separately before adding them. Instead, respectively the document frequencies 
and the term frequencies of the disjuncts can be added beforehand, properly 
multiplied by the translation probabilities. Only Ai in the big sum is constant 
for every addition and can therefore be moved outside the sum, resulting in: 



Using simple calculus (see e.g. 0), the probability measure can now be rewritten 
into a term weighting algorithm that assigns zero weight to non-matching terms, 
resulting in equation El The formula ranks documents in exactly the same order 
as equation El 



P{D,Si,S2,---,Sn) = 




P{D,Si,S2,---,Sn) = 




P{D,Si,S 2, - ■ ■ ,Sn) OC 



log(Et d)) + log(l-k 



n 



(i-A.)(E7=i nimm) Ei tf{t, d) > 



Equation El is the algorithm implemented in the TNO retrieval engine. It con- 
tains a weighted sum of respectively the term frequencies and the document 
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frequencies where the weights are determined by the translation probabilities 
Ti{j). Unweighted summing of frequencies was used before for on-line stemming 
in in a vector space model retrieval system. Unweighted summing of frequen- 
cies is implemented in the Inquery system as the “synonym operator” . Grouping 
possible translations of a source language term by the Inquery synonym operator 
has shown to be a successful approach to cross-language information retrieval 

The model does not require the translation probabilities Ti(j) to sum up to 
one for each i, since they are conditioned on the target language query term 
and not on the source language query term. Interestingly, for the final ranking 
it does not matter what the actual sum of the translation probabilities is. Only 
the relative proportions of the translations define the final ranking of documents. 
This can be seen by Ti(j) which occurs in the numerator and in the denominator 
of the big fraction in equation El 

2.5 A Relevance Feedback Method for Cross-Language Retrieval 

This paper introduces a new relevance feedback method for cross-language infor- 
mation retrieval. If there were some known relevant documents, then the values 
of Ti(j) and Xi could be re-estimated from that data. The idea is the following. 
Suppose there are three known relevant English documents to the French query 
“dechets dangereux”. If two out of three documents contain the term “waste” 
and none contain the terms “litter” and “garbage” then this is an indication that 
“waste” is the correct translation and should be assigned a higher translation 
probability than ‘litter” and “garbage” . If only one of the three known relevant 
document contains one or more possible translations of “dangereux” then this 
is an indication that the original query term “dechets” is more important (pos- 
sible translations occur in more relevant documents) than the original query 
term “dangereux” and the value of Xi should be higher for “dechets” than for 
“dangereux” . 

The actual re-estimation of Ti{j) and Xi was done by iteratively applying the 
EM-algorithm defined by the formulas in equation 0 In the algorithm, 
and denote the values on the pth iteration and r denotes the number of 
known relevant documents. The values are initialised with the translation prob- 
abilities from the dictionary and with = 0.3. The re-estimation formulas 
should be used simultaneously for each p until the values do not change signifi- 
cantly anymore. 







( 7 ) 



^ YZi r,(/)W((l-AfV(7’. = i«) + Xf P{T, = t^i)\D)) 



The re-estimation of Ti{j) and Xi was done from ‘pseudo-relevant’ documents. 
First the top 10 documents were retrieved using the default values of Ti{j) and 
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Xi and then the feedback algorithm was used on these documents to find the 
new values. The actual algorithm implemented was a variation of equation 0of 
the form: (1 / (r+1)) • (default value + J2k=i • ■ •) to avoid that e.g. Xi = 1 after 
re-estimation. 

3 Translation Resonrces 

As in previous years we applied a dictionary-based query translation approach. 
The translations were based on the VLIS lexical database of Van Dale publishers 

0. Because VLIS currently lacks translations into Italian, we used two other 
resources: i) the Systran web based MT engine ii) a probabilistic lexicon based a 
parallel web corpus. The next section will describe the construction of this new 
resource in more detail. 

3.1 Parallel Web Corpora 

We developed three parallel corpora based on web pages in close cooperation with 
RALI, Universite de Montreal. RALI already had developed an English-French 
parallel corpus of web pages, so it seemed interesting to investigate the feasibility 
of a full multilingual system based on web derived lexical resources only. We 
used the PTMiner tool 0 to find web pages which have a high probability to be 
translations of each other. The mining process consists of the following steps: 

1 . Query a web search engine for web pages with a hyperlink anchor text “En- 
glish version” and respective variants. 

2. (For each web site) Query a web search engine for all web pages on a par- 
ticular site. 

3. (For each web site) Try to find pairs of path names that match certain 
patterns, e.g.: /department/research/members/english/home.html and 
/department/research/members/ Italian . html. 

4. (For each pair) download web pages, perform a language check using a prob- 
abilistic language classifier, remove pages which are not positively identified 
as being written in a particular language. 

The mining process was run for three language pairs and resulted in three 
modest size parallel corpora. TableQlists sizes of the corpus during intermediate 
steps. Due to the dynamic nature of the web, a lot of pages that have been 
indexed, do not exist anymore. Sometimes a site is down for maintenance. Finally 
a lot of pages are simply place holders for images and are discarded by the 
language identification step. 

These parallel corpora have been used in different ways: i) to refine the 
estimates of translation probabilities of a dictionary based translation system 
(corpus based probability estimation) ii) to construct simple statistical trans- 
lation models jSj. The former application will be described in more detail in 
Section FT^ the latter in Section 1101 The translation models for English-Italian 
and English-German, complemented with an already existing model for English- 
French formed also the basis for a full corpus based translation multilingual run 
which is described elsewhere in this volume [ 3 . 
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Table 1. Intermediate sizes during corpus construction 



language 


number of 
web sites 


number of 
candidate pages 


number of 
candidate pairs 


retrieved and 
cleaned pairs 


EN-IT 


3,651 


1,053,649 


23,447 


4,768 


EN-DE 


3,817 


1,828,906 


33,577 


5,743 


EN-NL 


3,004 


1,170,082 


24,738 


2,907 



4 Merging Intermediate Runs 

Our strategy to multilingual retrieval is to translate the query into the document 
languages, perform separate language specific runs and merge the results into a 
single result file. In previous CLIR evaluations, we compared different merging 
strategies: 

round robin Here the idea is that document scores are not comparable across 
collections, because we are basically ignorant about the distribution of rel- 
evant documents in the retrieved lists, round robin assumes that these dis- 
tributions are similar across languages. 

raw score This type of merging assumes that document scores are comparable 
across collections. 

rank based It has been observed that the relationship between probability of 
relevance and the log of the rank of a document can be approximated by 
a linear function, at least for a certain class of IR systems. If a training 
collection is available, one can estimate the parameters of this relationship 
by applying regression. Merging can subsequently be based on the estimated 
probability of relevance. Note that the actual score of a document is only 
used to rank documents, but that merging is based on the rank, not on the 
score. 

The new CLEF multilingual task is based on a new document collection which 
makes it hard to compute reliable estimates for the linear parameters; a training 
set is not available. A second disadvantage of the rank based merging strat- 
egy is that the linear function generalises across topics. Unfortunately in the 
multilingual task, the distribution of relevant documents over the subcollections 
is quite skewed. All collections have several (differing) topics without relevant 
documents, so applying a rank based merging strategy would hurt the perfor- 
mance for these topics, because the proportion of retrieved documents in every 
collection is the same for every topic. 

The raw score merging strategy (which proved successful last year) does not 
need training data and also does not suffer from the equal proportions strategy. 
Unfortunately, usually scores are not totally compatible across collections. We 
have tried to identify factors which cause these differences. We have applied two 
normalisation techniques. First of all we treat term translations as a weighted 
concept vector (cf. section |2). That means that we can normalise scores across 
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topics by dividing the score by the query length. This amounts to computing the 
geometric average of probabilities per query concept. Secondly, we have observed 
that collection size has a large influence on the occurence probability estimates 
P{Ti\C) because the probability of rare terms is inversely proportional to the 
collection size. 




’offici.dat’ 

’It.dat’ 

’societi.dat’ 

’educ.dat’ 

’nation.dat’ 

’lunar.dat’ 

’war.dat’ 

’grade.dat’ 

’former.dat’ 

’mark.dat’ 

’friendship.dat’ 

’charg.dat’ 

Vinh.dat’ 

’say.dat’ 

’ex.dat’ 

’overthrown.dat’ 

’agenc.dat’ 

’time.dat’ 

’peopl.dat’ 

’school.dat’ 

’nghi.dat’ 



---X--- 



- - 0 - - 




Fig. 1. Probability estimates vs collection size 



Figure Elshows the probability estimates of a sample of words of 1 document 
when we add more documents to the collection. The occurrence probability of 
common words stabilises fast when the collection size increases. The more rare 
a word is however, the higher is the degree of overestimation of its occurrence 
probability. This effect is a consequence of the sparse data problem. In fact, a 
small collection will never yield correct term occurrence probability estimates. 

The collection-size dependency of collection-frequency (or global term fre- 
quency) estimates has a direct influence on the distribution of document scores 
for a particular query. When the collection is small, the scores will be lower than 
the scores on a large collection. This is due to the fact that the score we study 
is based on the maximum likelihood ratio. So the median of the distribution of 
document scores for a particular topic (set) is inversely related with the collec- 
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tion size. Thus when we use the raw scores of different subcollections as a basis 
for merging, large collections will be favoured. 

We hypothesised that we could improve the merging process, if we could 
correct the estimates for their dependence on the collection size. Suppose we 
have just two collections with a different size (and different language): C\,C2 with 
vocabulary size Vi,V2 and number of tokens T\, T2 respectively, with T\ « T2. 
Now we could try to either extrapolate the term occurrence probability estimates 
on collection C\ to a hypothetical collection with T2 tokens or try to ‘downscale’ 
the term occurrence probability estimates of a term from C2 to vocabulary size 
Vi. 

The first option seems cumbersome, because we have hardly information to 
guide the extrapolation process. The second option, trying to adapt the estimates 
of the large collection to the small collection, seems more viable. The idea is 
to adapt the probability estimates of rare terms in such a way, that they will 
become ‘compatible’ with the estimates on the small collection. As shown in 
figure 0 the estimates of frequent terms stabilise soon. Our idea is to construct a 
mapping function which maps the probability estimates to the small collection 
domain. The mapping function has the following requirements: a probability 
I/T2 has to be mapped to 1 /Ti. So the probability is multiplied by the factor 
T2/T1 and probabilities p larger than I/T2 will be multiplied by a factor which 
decreases for larger p. In fact we only want very small changes for p > 10 “^. A 
function which meets these properties is the polynomial f{x) = x — ax^ (where 
X — log(p) and a = ) . Because we have re-estimated the probabilities, 

one would expect that the probabilities have to be re-normalised ( pf(ti) = 
p{ti) / p{U) )• However, this has the result that all global probabilities (also 
those of relatively frequent words) are increased, which will increase the score of 
all documents, i.e. will have the opposite effect of what we want. So we decide not 
to re-normalise, because a smaller corpus would also have a smaller vocabulary, 
which would compensate for the increase in probability mass which is a result 
of the transformation. 



5 Results 

5.1 Monolingual Runs 

We indexed the collections in the 4 languages separately. All documents were 
lemmatised using the Xelda morphological toolkit from Xerox XRCE and stopped 
with language specific stoplists. For German, we splitted compounds and added 
both the full compound and its parts to the index. This strategy is motivated 
by our experience with a Dutch corpus (Dutch is also a compounding language) 
P] and tests on the TREC CLIR test collection. Table El shows the results of the 
monolingual runs, runs in bold are judged runs, runs in italic font are unofficial 
runs (mostly post-hoc). The table also lists the proportion of documents which 
has been judged. The standard runs include fuzzy lookup of unknown words. 
The expand option adds close orthographical variants for every query term. 
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The official runs were done without document length normalisation defined by 
equation 0 



Table 2. Results of the monolingual runs 



run name 


avp 


above median description 


%j@1000 %j@100 %j@10 


tnoutddl 


0.3760 




standard 


18.64 


79.05 


100 


tnoutdd2 


0.3961 


28/37 


-l-expand 


18.72 


81.22 


100 


tnoutdd2l 


0.3968 


- 


-(-length normalisation 


18.58 


78.22 


97.50 


tnoutffl 


0.4551 




standard 


16.13 


79.42 


100 


tnoutff2 


0.4471 


18/34 


-(-expand 


16.21 


80.88 


100 


tnoutjJ2l 


0.4529 


- 


-(-length normalisation 


16.00 


77.88 


97.50 


tnoutiil 


0.4677 




standard 


16.59 


78.92 


100 


tnoutii2 


0.4709 


18/34 


-(-expand 


16.67 


80.33 


100 


tnoutU2l 


0.4808 


- 


-(-length normalisation 


16.66 


77.25 


98 


tnouteeOli 


0.4200 




standard 


17.81 


71.10 


100 


tnouteeOl 


0.4169 


_ 


-(-expand 


17.84 


70.75 


99.75 


tnouteeOli 


0.4273 


- 


-(-length normalisation 


17.82 


69.30 


98.00 



The first thing that strikes us, is that the pool depth is 50, contrary to 
what has been practice in TREC in which the top 100 documents are judged for 
relevance. Section|^3analyses the CLEF collection further. Length normalisation 
usually gives a modest improvement in average precision. The ‘expand’ option 
was especially effective for German. The reason is probably that compound parts 
are not always properly lemmatised by the German morphology. Especially the 
German run performs well with 28 out of 37 topics above average. This relatively 
good performance is probably due to the morphology, which includes compound 
splitting. 

5.2 Bilingual Runs 

Table El lists the results of the bilingual runs. All runs use Dutch as a query 
language. The base run of 0.3069 can be improved by several techniques: a 
higher lambda, document length normalisation or Porter stemming instead of 
dictionary-based stemming. The latter can be explained by the fact that Porter’s 
algorithm is an aggressive stemmer that also removes most of the derivational 
affixes. This is usually beneficial to retrieval performance. The experiment with 
corpus based frequencies yielded disappointing results. We first generated topic 
translations in a standard fashion based on VLIS. Subsequently we replaced 
the translation probabilities P{wnl\wen) by rough corpus based estimates. We 
simply looked up all English sentences which contained the translation and de- 
termined the proportion of the corresponding (aligned) Dutch sentences that 
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contained the original Dutch query word. If the pair was not found, the origi- 
nal probability was left unchanged. Unfortunately a lot of the query terms and 
translations were not found in the aligned corpus, because they were lemmatised 
whereas the corpus was not lemmatised. At least this mismatch did hurt the es- 
timates. The procedure resulted in high translation probabilities for words that 
did not occur in the corpus and low probabilities for words that did occur. 



Table 3. Results of the bilingual runs 



run name 


avp above median description 


tnoutnel 


0.3069 


27/33 


standard 


tnoutnell 


0.3278 


- 


-I- doclen norm 


tnoutnelp 


0.3442 


- 


-tA = 0.7 


tnoutne2 


0.2762 


25/33 


corpus frequencies 


tnoutne3-stem 


0.3366 


- 


Porter stemmer -l-doclen norm 


tnoutne4 


0.2946 


20/33 


pseudo relevance feedback (PRF) 


tnoutne4~fix 


0.3266 


- 


PRF bugfix -l-doclen norm, Porter 


tnoutne4~retro 


0.4695 


- 


retrospective relevance feedback 



The pseudo relevance feedback runs were done with the experimental lan- 
guage models retrieval engine at the University of Twente, using an index based 
on the Porter stemming algorithm. The run tagged with tnoutneS-stem is the 
baseline run for this system. The official pseudo relevance feedback run used 
the top 10 documents retrieved to re-estimate relevance weights and translation 
probabilities, but turned out to contain a bug. The unofficial fixed run tnoutne4~ 
fix performs a little bit worse than the baseline. The run tnoutne^-retro uses the 
relevant documents to re-estimate the probabilities retrospectively (see e.g. El)- 
This run reaches an impressive performance of 0.4695 average precision, much 
higher even than the best monolingual English run. This indicates that the al- 
gorithm might be helpful in an interactive setting where the user’s feedback is 
used to retrieve a new, improved, set of documents. Apparently, the top 10 re- 
trieved contains too much noise to be useful for the re-estimation of the model’s 
parameters. 



5.3 Multilingual Runs 

Table 0 shows that our best multilingual run was a run with Dutch as a query 
language. This is on one hand surprising (because this run is composed of 4 
bilingual runs instead of 3 for the EN^X run. But the translation is based 
on the VLIS lexical database which is built on lexical relations with Dutch as a 
source language. Thus the translations in the NL^X case are much cleaner than 
the EN— >X case. In the latter case, Dutch serves as a pivot language. On the 
other hand, the NL^IT translation is quite cumbersome. We first used Xelda 
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to translate the Dutch queries to English stopped and lemmatised files. These 
files were subsequently translated by Systran. 



Table 4. Results of the X EN, FR, DE, IT runs 



run name 


avp 


above median 


description 


tnoutexl 


0.2214 


25/40 


baseline run 


tnoutex2 


0.2165 


26/40 


merged 


tnoutex2f 


0.2219 


_ 


fixed 


tnoutexS 


0.1960 


25/40 


Web based EN-IT lexicon 


tnoutnxl 


0.2256 


23/40 


query language is Dutch 



Another interesting point is that the intermediate bilingual run based on the 
parallel web corpus performed quite well, with an average precision of 0.2750 
versus 0.3203 of Systran. The translation of this run is based on a translation 
model trained on the parallel web corpus. The English topics were simply stopped 
and translated by the translation model. We took the most probable translation 
and used that as Italian query. We plan to experiment with a more refined 
approach where we import the translation probabilities into structured queries. 

5.4 The CLEF Collection 

This section reports on some of the statistics of the CLEF collection and compares 
it to the TREC cross-language collection. Table Ellists the size, number of judged 
documents, number of relevant documents and the judged fraction, which is the 
part of the collection that is judged per topic. 



Table 5. CLEF collection statistics, 40 topics (1-40) 



collection 


total 

docs. 


judged 

docs. 


relevant 

docs. 


no hits 
in topic 


judged 

fraction 


english 


110,250 


14,737 


579 


2, 6, 8, 23, 25, 27, 35 


0.0033 


french 


44,013 


8,434 


528 


2, 4, 14, 27, 28, 36 


0.0048 


german 


153,694 


12,283 


821 


2, 28, 36 


0.0020 


Italian 


58,051 


8,112 


338 


3, 6, 14, 27, 28, 40 


0.0035 


total 


366,008 


43,566 


2,266 




0.0022 



Table 0 lists the same information for the TREC collection. The collections 
are actually quite different. First of all, the CLEF collection is almost half the 
size of the TREC collection and heavily biased towards German and English 
documents. Although the CLEF organisation decided to judge only the top 50 
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Table 6. TREC collection statistics, 56 topics (26-81) 



collection 


total 

docs. 


judged 

docs. 


relevant 

docs. 


no hits 
in topic 


judged 

fraction 


english 


242,866 


18,783 


2,645 


26, 46, 59, 63, 66, 75 


0.0014 


french 


141,637 


11,881 


1,569 


76 


0.0015 


german 


185,099 


8,656 


1,634 


26, 60 ,75, 76 


0.0008 


Italian 


62,359 


7,396 


671 


26, 44, 51, 60, 63, 75, 80 


0.0021 


total 


631,961 


46,716 


6,519 




0.0013 



of documents retrieved and not the top 100 documents retrieved as in TREC, 
the number of documents judged per topic is only a little lower for the CLEF 
collection: about 814 documents per topic vs. 834 for TREC. Given the fact that 
the 56 TREC topics were developed over a period of two years and the CLEF 
collection has 40 topics already, the organisation actually did more work this 
year compared to pervious years. Another striking difference is the number of 
relevant documents per topic, only 57 for CLEF and 116 for TREC. This might 
actually make the decision to only judge the top 50 of runs not that harmful for 
the usefulness of the CLEF evaluation results. 



6 Conclusions 

This year’s evaluation has confirmed that cross-language retrieval based on struc- 
tured queries, no matter what the translation resources are, is a powerful tech- 
nique. Re-estimating model parameters based on pseudo relevant documents 
does not result in improvement of retrieval performance. However, the relevance 
weighting algorithm shows an impressive performance gain if the relevant docu- 
ments are used retrospectively. This indicates that the algorithm might in fact 
be a valuable tool for processing user feedback in an inter-active setting. Finally, 
merging based on the collection size re-estimation technique proved not success- 
ful. Further analysis is needed to find out why the technique did not work on 
this collection, as it was quite successful on the TREC-8 collection. 
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Abstract. For our participation in CLEF, the Berkeley group partici- 
pated in the monolingual, multilingual and GIRT tasks. To help enrich 
the CLEF relevance set for future training, we prepared a manual refor- 
mulation of the original German queries which achieved excellent per- 
formance, more than 110% better than average of median precision. The 
GIRT task performed English- German Gross-Language IR by comparing 
commercial machine translation with thesaurus lookup techniques and 
query expansion techniques. Combining all techniques using simple data 
fusion produced the best results. 



1 Introduction 

Unlike monolingual retrieval where the queries and documents are in the same 
language and where mechanistic techniques can be applied, Cross-language in- 
formation retrieval (CLIR) must combine linguistic techniques (phrase discovery, 
machine translation, bilingual dictionary lookup) with robust monolingual infor- 
mation retrieval. The Berkeley Text Retrieval Research group has been using the 
technique of logistic regression from the beginning of the TREC series of con- 
ferences. Indeed our primary development has been a result of the U.S. TREC 
conferences and collections which provided the first large-scale test collection for 
modern information retrieval experimentation. In TREC-2 p[] we derived a sta- 
tistical formula for predicting probability of relevance based upon statistical clues 
contained within documents, queries and collections as a whole. This formula 
was used for document retrieval in Chinese |3| and Spanish in TREC-4 through 
TREC-6. We utilized the identical formula for English queries against German 
documents in the cross-language track for TREC-6. In TREC-7 the formula was 
also used for cross-language runs over multiple European languages. During the 
past year the formula has proven well-suited for Japanese and Japanese-English 
cross-language information retrieval |Z[, even when only trained on English doc- 
ument collections. Our participation in the NTCIR Workshop in Tokyo 
(http://www.rd.nacsis.ac.jp/htcadm/workshop/work-en.html) 
led to different techniques for cross-language retrieval, ones which utilized the 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 116-E2EI 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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power of human indexing of documents to improve retrieval via bi-lingual lex- 
icon development and a form of text categorization which associated terms in 
documents with humanly assigned index terms These techniques were applied 
to English-German retrieval for the GIRT-1 task and collection in the TREG-8 
conference 0 



2 Logistic Regression for Document Ranking 

The document ranking formula used by Berkeley in all of our GLEE retrieval 
runs was the TREG-2 formula The ad hoc retrieval results on the TREG test 
collections have shown that the formula is robust for long queries and manually 
reformulated queries. Applying the same formula (trained on English TREG 
collections) to other languages has performed well, as on the TREG-4 Spanish 
collections, the TREG-5 Ghinese collection p] and the TREG-6 and TREG- 
7 European languages (French, German, Italian) ^IS|. Thus the algorithm has 
demonstrated its robustness independent of language as long as appropriate word 
boundary detection (segmentation) can be achieved. The logodds of relevance of 
document D to query Q is given by 

= , 1 ) 

= -3.51-k^i <P+.m2^*N (2) 

Viv + 1 



N 

^ = 37.4^ 

i=l 



ql + ?,b 



N 

0.330^ log 
2=1 



dtfi 
dl + SQ 



N 

-0.1937^ log 
2=1 



cl 



( 3 ) 



where P{R\D^ Q) is the probability of relevance of document D with respect to 
query Q, P{R\D, Q) is the probability of irrelevance of document D with respect 
to query Q. Details about the derivation of these formulae may be found in our 
NTGIR workshop paper 0. It is to be emphasized that training has taken place 
exclusively on English documents but the matching has proven robust over seven 
other languages in monolingual retrieval, including Japanese and Ghinese where 
word boundaries form an additional step in the discovery process. 



3 Submissions for the CLEF Main Tasks 

For GLEE we submitted 8 runs, 4 for the Monolingual (non-English) task and 

4 for the Multilingual task. 

The following sections give a description for each run. 
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For the Monolingual task we submitted: 




Run Name 


Language 


Run type 


Priority 


BKMOGGMl 


German 


Manual 


1 


BKMOFFA2 


French 


Automatic 


2 


BKMOGGAl 


German 


Automatic 


3 


BKMOIIA3 


Italian 


Automatic 


4 


For the Multilingual task we submitted: 




BKMUEAAl 


English 


Automatic 


1 


BKMUGAMl 


German 


Manual 


2 


BKMUEAA2 


English 


Automatic 


3 


BKMUGAA3 


German 


Automatic 


4 



Table 1. Summary of eight official CLEF runs. 



3.1 Monolingual Retrieval of the CLEF Collections 

BKM0IIA3 (Berkeley Monolingual Italian against Italian Automatic Run 3) 
The original query topics in Italian were searched against the Italian collection 
(La Stampa). For indexing this collection, we used a stopword list, the Italian-to- 
lower normalizer and the Italian stemmer (from association dictionary) described 
in Section 4. 

BKMOFFA2 (Berkeley Monolingual French against French Automatic Run 

2 ) _ 

The original query topics in French were searched against the French col- 
lection (Le Monde). For indexing this collection, we used a stopword list, the 
French-to-lower normalizer and the French stemmer (from association dictio- 
nary) described in Section 4. 

BKMOGGAl (Berkeley Monolingual German against German Automatic 
Run 1) 

The original query topics in German were searched against the German col- 
lection (Frankfurter Rundschau and Der Spiegel). For indexing the collection, 
we used a stopword list that contained also capitalized versions of words and 
the German stemmer (from association dictionary) described in Section 3.4. We 
did not use a normalizer for this collection because all nouns in German are 
capitalized and hence this clue might be used in retrieval. 

4. BKMOGGMl (Berkeley Monolingual German against German Manual 
Run 1) The original query topics in German were extended with additional 
query terms obtained by searching the German GLEF collection (Frankfurter 
Rundschau and Der Spiegel) with the original German query topics and looking 
at the results for these original queries (with the help of Aitao Ghen’s Gross- 
language Text Retrieval System Web-interface) . The additional query terms were 
obtained by either directly looking at the documents or looking at the top ranked 
document terms for the original query text. The searcher spent about 10 to 
25 minutes per topic or query depending on familiarity with the context and 
meaningfulness of the returned documents and top ranked document terms. For 
indexing the collection, we used a stopword list that contained also capitalized 
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versions of words and the German stemmer (from association dictionary) built 
by Aitao Chen. We didn’t use a normalizer for this collection. 



3.2 Monolingual Performance 

Our monolingual performance can be found in Table |21 While average of medians 
cannot be considered a meaningful statistic from which inference can be made, 



Run ID 


BKMOIIA3 


BKMOFFA2 


BKMOGGAl 


BKMOGGMl 


Retrieved 


34000 


34000 


37000 


37000 


Relevant 


338 


528 


821 


821 


Rel. Ret 


315 


508 


701 


785 


Precision 
at 0.00 


0.7950 


0.7167 


0.6342 


0.6907 


at 0.10 


0.7617 


0.6824 


0.5633 


0.6584 


at 0.20 


0.6601 


0.5947 


0.5173 


0.6442 


at 0.30 


0.6032 


0.5195 


0.3999 


0.6037 


at 0.40 


0.5756 


0.4825 


0.3687 


0.5624 


at 0.50 


0.5336 


0.4404 


0.3181 


0.5428 


at 0.60 


0.4189 


0.3627 


0.2731 


0.4970 


at 0.70 


0.3098 


0.2960 


0.2033 


0.4580 


at 0.80 


0.2417 


0.2422 


0.1704 


0.4006 


at 0.90 


0.1816 


0.1936 


0.1364 


0.2959 


at 1.00 


0.1533 


0.1548 


0.0810 


0.2059 


Brk. Prec. 


0.4601 


0.4085 


0.3215 


0.4968 


Med. Prec. 


0.4453 


0.4359 


0.3161 


0.3161 



Table 2. Results of four official CLEF monolingual runs. 



we have found it useful to average the medians of all queries as sent by CLEF 
organizers. Comparing our overall precision to this average of medians gives 
us some fuzzy gauge of whether our performance is better, poorer, or about the 
same as the median performance. Thus the bottom two rows of the table present 
the Berkeley overall precision over all queries for which performance has been 
judged and, below it, the average of the median precision for each query over 
all submitted runs. From this we see that Berkeley’s automatic runs are about 
the same as the overall ’average’ while Berkeley’s German-German manual run 
comes in at overall precision 57 percent better than Average of Median precisions 
for German-German monolingual runs. As we shall see in the next section, an 
improved German query set had an even greater impact on multilingual retrieval. 

Another observation to make is that of the skewedness of relevancy. More 
than twice as many relevant documents come from the German collection than 
the Italian collection. Thus a better German query set may have an impact on 
multilingual retrieval more than a better Italian query set. 
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3.3 Multilingual Retrieval of the CLEF Collections 

Several interesting questions have arisen in recent research on CLIR. First, is 
CLIR merely a matter of a marriage of convenience between machine transla- 
tion combined with ordinary (monolingual) information retrieval? In our CLEF 
work we made use of two widely available machine translation packages, the 
SYSTRAN system found at the AltaVista site, and the Lernout and Hauspie 
Power Translator Pro Version 7.0. For the GIRT retrieval we made comparisons 
to Power Translator. For CLEF multilingual we combined translations and dic- 
tionary lookup from multiple sources, having found that different packages made 
different mistakes on particular topics. Second, what is the role of language spe- 
cific stemming in improved performance? Our experience with the Spanish tracks 
of TREC have convinced us that some form of stemming will always improve 
performance. For this particular evaluation we chose to create a stemmer mech- 
anistically from common leading substring analysis of the entire corpus. The 
impact of the stemmer on performance will be discussed at the end of the offi- 
cial results discussion. Third, is performance improved by creating a multilingual 
index by pooling all documents together in one index or by creating separate 
language indexes and doing monolingual retrieval for each language followed by 
data fusion which combines the individual rankings into a unified ranking inde- 
pendent of language? This was one of the major focuses of our experiments at 
CLEF. 

1. BKMUEAAl (Berkeley Multilingual English against all Automatic Run 

1 ) _ 

The original query topics in English were translated once with the Systran 
system 

(http://babel.altavista.com/translate.dyn) and with L&H Powertranslator. The 
English topics were translated into French, German, and Italian. The two trans- 
lated files for each language were pooled and then put together in one query 
file (the English original query topics were multiplied by 2 to gain the same 
frequency of query terms in the query file). The final topics file contained 2 
English (original), French, German, and Italian versions (one Powertranslator 
and one Systran) for each topic. During the search, we divided the frequency 
of the search terms by 2 to avoid over-emphasis of equally translated search 
terms. The collection consisted of all languages. For indexing the English part 
of this collection, we used a stopword list, the default normalizer and the Porter 
stemmer. For indexing the French part of this collection, we used a stopword 
list, the French-to-lower normalizer and the French stemmer (from association 
dictionary in section 3.5). For indexing the German part of the collection, we 
used a stopword list that contained also capitalized versions of words and the 
German stemmer (from association dictionary) build by Aitao Ghen. We didn’t 
use a normalizer for this collection. For indexing the Italian part of this col- 
lection, we used a stopword list, the Italian-to-lower normalizer and the Italian 
stemmer (from association dictionary). 
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Run ID 


BKMUEAAl 


BKMUEAA2 


BKMUGAA2 


BKMUGAMl 


Retrieved 


40000 


40000 


40000 


40000 


Relevant 


2266 


2266 


2266 


2266 


Rel. Ret. 


1434 


1464 


1607 


1838 


Precision 
at 0.00 


0.7360 


0.7460 


0.7238 


0.7971 


at 0.10 


0.5181 


0.5331 


0.5046 


0.6534 


at 0.20 


0.4287 


0.4465 


0.4229 


0.5777 


at 0.30 


0.3545 


0.3762 


0.3565 


0.5032 


at 0.40 


0.2859 


0.2929 


0.3027 


0.4373 


at 0.50 


0.2183 


0.2290 


0.2523 


0.3953 


at 0.60 


0.1699 


0.1846 


0.1990 


0.3478 


at 0.70 


0.1231 


0.1454 


0.1682 


0.3080 


at 0.80 


0.1020 


0.0934 


0.1295 


0.2238 


at 0.90 


0.0490 


0.0480 


0.0622 


0.1530 


at 1.00 


0.0136 


0.0081 


0.0138 


0.0474 


Brk. Prec. 


0.2502 


0.2626 


0.2654 


0.3903 


Med. Prec. 


0.1843 


0.1843 


0.1843 


0.1843 



Table 3. Results of four official CLEF multilingual runs. 



2. BKMUEAA2 (Berkeley Multilingual English against all Automatic Run 

2 ) _ 

The original query topics in English were translated once with Systran and 
with L&H Power Translator. The English topics were translated into French, 
German, and Italian. The 2 translated versions for each language were pooled 
together in one query hie (resulting in 3 topics hies, one in German with the 
Systran and Powertranslator version, one in French with the Systran and Pow- 
ertranslator version, and one in Italian accordingly). The original English topics 
hie was searched against the English collection (Los Angeles Times). The pooled 
German topics hie was searched against the German collection, the pooled French 
topics hie was searched against the French collection, and the pooled Italian top- 
ics hie was searched against the Italian collection. The frequency of the search 
terms was divided by 2 to avoid over-emphasis of equally translated search terms. 
This resulted in 4 result hies with the 1000 top ranked records for each topic. 
These 4 result hies were then pooled together and sorted by weight (rank) for 
each record and topic. The pooling method is described below. For a description 
of the collections see BKMOGGMl, BKMOFFA2, BKMOIIA3, BKMUEAAl. 

3. BKMUGAA2 (Berkeley Multilingual German against all Automatic Run 

2 ) _ 

The original query topics in German were translated once with Systran and 
with Powertranslator. The German topics were translated into English, French, 
and Italian. The 2 translated versions for each language were pooled together in 
one query hie. The original German topics hie was multiplied by 2 to gain the 
same frequency of query terms in the query hie searched. The hnal topics hie 
contained 2 German (original), English, French, and Italian versions (one Pow- 
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ertranslator and one Systran) for each topic. During the search, we divided the 
frequency of the search terms by 2 to avoid over-emphasis of equally translated 
search terms. For a description of the collection see BKMUEAAl. 

4. BKMUGAMl (Berkeley Multilingual German against all Manual Run 1) 

The manually extended German query topics (see description from 
BKMOGGMl) were now translated with Powertranslator into English, French 
and Italian. These translations were pooled together with the German originals 
in one file. This topics file was searched against the whole collection including 
all 4 languages. For a description of the collection see BKMUEAAl. 



3.4 Berkeley’s CLEF Multilingual Performance 

Our multilingual performance can be found in Table 0 

As contrasted with the average of medians for monolingual, the values in the 
last row of the table are the same for all columns. Our automatic runs performed 
almost identically at about 38 percent better than average of medians, while 
the run BKMUGAMl at overall precision 0.39 is 112 percent greater than the 
average of multilingual query medians. 



3.5 Building a Simple Stemmer for Cross-Language Information 
Retrieval 

A stemmer for the French collection was created by first translating all the 
distinct French words found in the French collection into English using SYS- 
TRAN. The English translations were normalized by reducing verbs to the base 
form, nouns to the singular form, and adjectives to the positive form. All the 
French words which have the same English translations after normalization were 
grouped together to form a class. A member from each class is selected to rep- 
resent the whole class in indexing. All the words in the same class were replaced 
by the class representative in indexing. 

The German stemmer and Italian stemmer were created similarly. 

We submitted four monolingual runs and four multilingual runs. These eight 
runs were repeated without the French, German, and Italian stemmers. The 
overall precision for each of the eight runs without stemming are shown in column 
3 of table 0 Golumn 4 shows the overall precision with the French, German, 
and Italian stemmers. Golumn 5 shows the improvement in precision which can 
be attributed to the stemmers. 

The overall precision for pooling queries and without stemming (the method 
we applied two years ago) for the multilingual run using English queries was 
.2335. With stemming and pooling documents, the overall precision for the same 
run was .2626, which is 12.46 percent better. This can be considered as additional 
evidence that adding a stemming capability will result in an improvement in 
automatic multilingual retrieval. 
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RUN ID 


TASK 


RESULTS 

(unstemmed) 


OFFICIAL RESULTS 
(stemmed) 


Change 

Change 


BKMUEAAl 


multilingual 


0.2335 


0.2502 


7.15pct 


BKMUEAA2 


multilingual 


0.2464 


0.2626 


6.57pct 


BKMUGAA3 


multilingual 


0.2524 


0.2654 


5.15pct 


BKMUGAMl 


multilingual 


0.3749 


0.3903 


4.10pct 


BKMOFFA2 


monolingual 


0.3827 


0.4085 


6.74pct 


BKMOGGAl 


monolingual 


0.3113 


0.3215 


3.27pct 


BKMOGGMl 


monolingual 


0.4481 


0.4968 


10.86pct 


BKMOIIA3 


monolingual 


0.4054 


0.4601 


13.49pct 



Table 4. Results of Stemming Experiments 



3.6 Data Fusion or Monolingual Document Pooling 

The second idea centers on pooling documents from monolingual retrieval runs. 
The brain-dead solution would be to simply combine the retrieval results from 
four monolingual retrieval runs and sort the combined results by the estimated 
probability of relevance. The problem with the simple combination approach is 
that when the estimated probability of relevance is biased toward one document 
collection (as the above statistics show for German), the documents from that 
collection will always appear in the top in the combined list of ranked documents. 
For our final run, we took a more conservative approach by making sure the top 
50 documents from each of the four monolingual list of documents will appear 
in top 200 in the combined list of documents. 

3.7 Failure Analysis 

A query-by-query analysis can be done to identify problems. We have not had 
time to do this, but one query stands out. Query 40 about the privatization of the 
German national railway was one which seems to have given everyone problems 
(the median precision over all GLEF runs was 0.0537 for this query). As an 
American group, we were particularly vexed by the use of the English spelling 
’privatisation’ which couldn’t be recognized by either of our machine translation 
softwares. The German version of the topic was not much better ~ in translation 
its English equivalent became ’de-nationalization’ a very uncommon synonym for 
’privatization,’ and one which yielded few relevant documents. By comparison, 
our German manual reformulation of this query resulted in an average precision 
of 0.3749 for best GLEF performance for this query. 

4 GIRT Retrieval 

A special emphasis of our current funding has focussed upon retrieval of spe- 
cialized domain documents which have been assigned individual classification 
identifiers by human indexers. These classification identifiers come from what 
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we call ’’domain ontologies”, of which thesauri are a particular case. Since many 
millions of dollars are expended on developing these classification ontologies and 
applying them to index documents, it seems only natural to attempt to exploit 
the resources previously expended to the fullest extent possible to improve re- 
trieval. In some cases such thesauri are developed with identifiers translated (or 
provided) in multiple languages. This has been done in Europe with the GEMET 
(General European Multilingual Environmental Thesaurus) effort and with the 
OEGD General Thesaurus (available in English, French, and Spanish). A review 
of multilingual thesauri can be found in . 

The GIRT collection consists of reports and papers (grey literature) in the 
social science domain. The collection is managed and indexed by the GESIS 
organization (http://www.social-science-gesis.de). GIRT is an excellent example 
of a collection indexed by a multilingual thesaurus, originally German-English, 
recently translated into Russian. We worked extensively with a previous version 
of the GIRT collection in our cross-language work for TREG-8 jS] 

4.1 The GIRT Collection 

There are 76128 German documents in GIRT subtask collection. Of them, about 
54275 (72 percent) have English TITLE sections. 5317 documents (7 percent) 
have also English TEXT sections. Almost all the documents contain manually 
assigned thesaurus terms. On average, there are about 10 thesaurus terms as- 
signed to each document. 

In our experiments, we indexed only the TITLE and TEXT sections in each 
document (not the E-TITLE or E-TEXT). The GLEE rules specified that in- 
dexing any other field would need to be declared a manual run. For our GLEE 
runs this year we added a German stemmer similar to the Porter stemmer for 
the German language. Using this stemmer led to a 15 percent increase in average 
precision when tested using the GIRT-1 collection of TREG-8. 

4.2 Query Translation 

In GLIR, essentially either queries or documents or both need to be translated 
from one language to another. Query translation is almost always selected for 
practical reasons of efficiency, and because translation errors in documents can 
propagate without discovery since the maintainers of a text archive rarely read 
every document. 

For the GLEE GIRT task, our focus has been to compare the performance 
of different translation strategies. We applied the following three methods to 
translate the English queries to German: Thesaurus lookup. Entry Vocabulary 
Module (EVM), machine translation (MT). The resulted German queries were 
run against the GIRT collection. 



Thesaurus Lookup The GIRT social science Thesaurus is a German-English 
bilingual thesaurus. Each German item in this thesaurus has a corresponding 
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English translation. We took the following steps to translate the English query 
to German by looking up the thesaurus: 

a. Create an English-German transfer dictionary from the Social Science The- 
saurus. This transfer dictionary contains English items and their corresponding 
German translations. This ’’vocabulary discovery” approach was taken by Eich- 
mann, Ruiz and Srinivasan for medical information cross-language retrieval using 
the UMLS Metathesaurus 0. 

b. Use the part-of-speech tagger LT-POS developed by University of Edin- 
burgh 

(http://www.ltg.ed.ac.uk/software/pos/index.html) to tag the English query 
and identify noun phrases in the English query. One problem with thesaurus 
lookup is how to match the phrasal items in a thesaurus. We have taken a 
simple approach to deal with this problem: use POS tagger to identify noun 
phrases. 

For last year’s GIRT task at the TREC-8 evaluation, we extracted an English- 
German transfer dictionary from the GIRT thesaurus and used it to translate 
the English queries to German. This approach left about 50 percent of En- 
glish query words untranslated. After examining the untranslated English query 
words carefully, we found that most of them fell into the following two cate- 
gories: one category contains general terms that are not likely to occur in a 
domain-specific thesaurus like GIRT. Examples are ’’country”, ’’car”, ’’foreign”, 
’’industry”, ’’public”, etc. The other category are terms that occur in the the- 
saurus but in a different format from the original English query words. For 
example, ” bosnia-herzegovina” does not appear in the thesaurus, but ’’bosnia 
and herzegovina” does. 



Fuzzy Matching for the Thesaurus To deal with the general terms in the 
first category, a general-purpose dictionry was applied after thesaurus lookup. 
A fuzzy-matching strategy was used to address the problem for the second cate- 
gory. It counts the letter pairs that two strings have in common and uses Dice’s 
coefficient as a means of accessing the similarity between the two strings. This 
fuzzy-matching strategy successfully recovered some query terms, for example. 



original query terms 


thesaurus terms 


asylum policy 
anti-semitism 
bosnia-herzegovina 

gypsy 

German Democratic Republic 
Republic (gdr) 


policy on asylum 

antisemitism 

bosnia and herzegovina 

gipsy 

German Democratic 



Fuzzy matching also found related terms for some query terms which do not 
appear in the thesaurus at all, for example see the following table. 
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original query terms 


thesaurus terms 


nature protection legislation 

violent act 

bosnia 


nature protection 

violence 

bosnian 



We tested this combined approach using last year’s GIRT-1 data. The results 
showed about 18 percent increase as measured by average precision compared 
with simple thesaurus lookup. 



Entry Vocabulary Module (EVM) In the GIRT collection, about 72 percent 
of the documents have both German titles and English titles. 7 percent have also 
English text sections. This feature allows us to build a EVM which maps the 
English words appearing in English Title and text sections to German thesaurus 
terms. This mapping can then be used to translate the English queries. More 
details about this work can be found in jSj. 

Machine Translation (MT) For comparison, we also applied the Lernout and 
Hauspie Power Translator product to translate the English queries into German. 



Merging Results While our GLEE Multilingual strategy focussed on merging 
monolingual results run independently on different subcollections, one per lan- 
guage, all our GIRT runs were done on a single subcollection, the German text 
part of GIRT. When analyzing the experimental training results, we noticed that 
different translation methods retrieved sets of documents that contain different 
relevant documents. This implies that merging the results from different trans- 
lation methods may lead to better performance than of any one of the methods. 
Since we use the same retrieval algorithm and data collection for all the runs, 
the probability that a document is relevant to a query from different runs are 
commensurable. So, for each document retrieved, we used the sum of its prob- 
ability from the different runs as its final probability to create the ranking for 
the merged results. 

4.3 Results and Analysis 

Our GIRT results are summarized in Table El The runs can be described as 
follows: BKGREGA4 used our entry vocabulary method to map from query 
term to thesaurus term, the top ranked thesaurus term and its translation was 
used to create the German query. BKGREGA3 used the results of machine 
translation by the L&H Power Translator software. The run BKGREGA2 used 
thesaurus lookup of English terms in the query and a general purpose English 
German dictionary for not found terms as well as the fuzzy matching strategy 
described above. The final run BKGREGAI pooled the merged results from the 
other three runs according to the sum of probabilities of relevance. Note that 
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it performs significantly better than the other three runs, and about 61 percent 
better than the average of median precisions for the CLEF GIRT. One reason is 
that different individual runs performed much better on different queries. The 
three individual methods achieved best precision in eight of the 25 queries and 
the fusion run achieved best precision for another 4 queries. 



Run ID 


BKGREGAl 


BKGREGA2 


BKGREGA3 


BKGREGA4 


Retrieved 


23000 


23000 


23000 


23000 


Relevant 


1193 


1193 


1193 


1193 


Rel. Ret. 


901 


772 


563 


827 


at 0.00 


0.7013 


0.5459 


0.6039 


0.6139 


at 0.10 


0.5610 


0.4436 


0.3662 


0.4482 


at 0.20 


0.4585 


0.4172 


0.2881 


0.3583 


at 0.30 


0.4203 


0.3576 


0.2633 


0.3292 


at 0.40 


0.3774 


0.3165 


0.2486 


0.2465 


at 0.50 


0.3454 


0.2856 


0.2266 


0.2004 


at 0.60 


0.2938 


0.2548 


0.1841 


0.1611 


at 0.70 


0.2025 


0.1816 


0.1107 


0.1477 


at 0.80 


0.1493 


0.1439 


0.0663 


0.1252 


at 0.90 


0.0836 


0.0829 


0.0575 


0.0612 


at 1.00 


0.0046 


0.0075 


0.0078 


0.0003 


Brk. Prec. 


0.3119 


0.2657 


0.2035 


0.2299 


Med. Prec. 


0.1938 


0.1938 


0.1938 


0.1938 



Table 5. Results of four official GIRT English-German runs. 



5 Summary and Acknowledgments 

Berkeley’s participation in CLEF has enabled us to explore refinements in cross- 
language information retrieval. Specifically we have explored two data fusion 
methods - for the CLEF multilingual we developed a technique for merging from 
monolingual, language specific rankings which ensured representation from each 
constituent language. For the GIRT English-German task, we obtained improved 
retrieval by fusion of the results of multiple methods of mapping from English 
queries to German. A new stemming method was developed which maps classes 
of words to a representative word in both English and the targeted languages 
of French, German, and Italian. For future research we are creating a Russian 
version of the GIRT queries to test strategies for Russian-German retrieval via 
a multilingual thesaurus. 
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Translingual Information Detection, Extraction, and Summarization Program 
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Abstract. We present an approach to multilingual information retrieval that 
does not depend on the existence of specific linguistic resources such as 
stemmers or thesauri. Using the HAIRCUT system we participated in the 
monolingual, bilingual, and multilingual tasks of the CLEF-2000 evaluation. 
Our approach, based on combining the benefits of words and character n-grams, 
was effective for both language-independent monolingual retrieval as well as 
for cross-language retrieval using translated queries. After describing our 
monolingual retrieval approach we compare a translation method using aligned 
parallel corpora to commercial machine translation software. 



1 Background 

The Hopkins Automated Information Retriever for Combing Unstructured Text 
(HAIRCUT) is a research retrieval system developed at the Johns Hopkins University 
Applied Physics Lab (APL). One of the research areas that we want to investigate 
with HAIRCUT is the relative merit of different tokenization schemes. In particular 
we use both character n-grams and words as indexing terms. Our experiences in the 
TREC evaluations have led us to believe that while n-grams and words are 
comparable in retrieval performance, a combination of both techniques outperforms 
the use of a single approach Q. Through the CLEF -2000 evaluation we demonstrate 
that unsophisticated, language-independent techniques can form a credible approach 
to multilingual retrieval. We also compare query translation methods based on 
parallel corpora with automated machine translation. 



2 Overview 

We participated in the monolingual, bilingual, and multilingual tasks. For all three 
tasks we used the same eight indices, a word and an n-gram (n=6) based index in each 
of the four languages. Information about each index is provided in Table 1. In all of 
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our experiments documents were indexed in their native language because we prefer 
query translation over document translation for reasons of efficiency. 

Table 1. Index statistics for the CLEF collection 





# docs 


collection size 
(MB gzipped) 


name 


# terms 


index size (MB) 


English 


110,282 


163 


enw 


219,880 


255 


en6 


2,668,949 


2102 


French 


44,013 


62 


frw 


235,662 


96 


fr6 


1,765,656 


769 


German 


153,694 


153 


gew 


1,035,084 


295 


ge6 


3,440,316 


2279 


Italian 


58,051 


78 


itw 


278,631 


130 


it6 


1,650,037 


1007 



We used two methods of translation in the bilingual and multilingual tasks. We 
used the Systran® translator to convert French and Spanish queries to English for our 
bilingual experiments and to convert English topics to French, German and Italian in 
the multilingual task. For the bilingual task we also used a method based on extracting 
translation equivalents from parallel corpora. Parallel English/French documents 
were most readily available to us, so we only applied this method when translating 
French to English. 



2.1 Index Construction 

Documents were processed using only the permitted tags specified in the workshop 
guidelines. First SGML macros were expanded to their appropriate character in the 
ISO-8859- 1 character set. Then punctuation was eliminated, letters were downcased, 
and only the first two of a sequence of digits were preserved (e.g., 1920 became 
19##). Diacritical marks were preserved. The result is a stream of blank separated 
words. When using n-grams we construct indexing terms from the same stream of 
words; the n-grams may span word boundaries but sentence boundaries are noted so 
that n-grams spanning sentence boundaries are not recorded. Thus n-grams with 
leading, central, or trailing spaces are formed at word boundaries. We used a 
combination of unstemmed words and 6-grams with success in the TREC-8 CLIR 
task and decided to follow the same strategy this year. As can be seen from Table 
1 , the use of 6-grams as indexing terms increases both the size of the inverted file and 
the dictionary. 



2.2 Query Processing 

HAIRCUT performs rudimentary preprocessing on queries to remove stop structure, 
e.g., affixes such as “... would be relevant” or “relevant documents should....” A list 
of about 1000 such English phrases was translated into French, German, and Italian 
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using both Systran and the FreeTranslation.com translator. Other than this 
preprocessing, queries are parsed in the same fashion as documents in the collection. 

In all of our experiments we used a simple two-state hidden Markov model that 
captures both document and collection statistics Q. This model is alternatively 
described as a linguistically motivated probabilistic model and has been 
compared to the vector cosine and probabilistic models Aftei^me query is parsed 
each term is weighted by the query term frequency and an initial retrieval is 
performed followed by a single round of relevance feedback. 

To perform relevance feedback we first retrieve the top 1000 documents. We use 
the top 20 documents for positive feedback and the bottom 75 documents for negative 
feedback, however we check to see that no duplicate or neo-duplicate documents are 
included in these sets. We then select terms for the expanded query based on three 
factors, a term’s initial query term frequency (if any), the cube root of the (a=3, (3=2, 
7=2) Rocchio score, and a metric that incorporates an idf component. The top-scoring 
terms are then used as the revised query. After retrieval using this expanded and 
reweighted query, we have found a slight improvement by penalizing document 
scores for documents missing many query terms. We multiply document scores by a 
penalty factor: 

/ xl.25 

__ , _ # of missing terms 

PF =1.0- ^ 

total number of terms in query 

We use only about one-fifth of the terms of the expanded query for this penalty 
function. 





# Top Terms 


# Penalty terms 


words 


60 


12 


6-grams 


400 


75 



We conducted our work on a 4-node Sun Microsystems Ultra Enterprise 450 
server. The workstation had 2 GB of physical memory and access to 50 GB of 
dedicated hard disk space. 

The HAIRCUT system comprises approximately 25,000 lines of Java code. 



3 Monolingual Experiments 

Our approach to monolingual retrieval was to focus on language independent 
methods. We refrained from using language specific resources such as stoplists, lists 
of phrases, morphological stemmers, dictionaries, thesauri, decompounders, or 
semantic lexicons (e.g. Euro WordNet). We emphasize that this decision was made, 
not from a belief that these resources are ineffective, but because they are not 
universally available (or affordable) and not available in a standard format. Our 
processing for each language was identical in every regard and was based on a 
combination of evidence from word-based and 6-gram based runs. We elected to use 
all of the topic sections for our queries. 
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Fig. 1. Recall-precision curves for the monolingual task. The English curve is unofficial and is 
produced from the bilingual relevance judgments. 

The retrieval effectiveness of our monolingual runs is fairly similar for each of the 
four languages as evidenced by Figure 1. We expected to do somewhat worse on the 
Italian topics since the use of diacritical marks differed between the topic statements 
and the document collection; consistent with our ‘language-independent’ approach we 
did not correct for this. Given the generally high level of performance, both in 
average precision and recall, and in the number of ‘best’ and ‘above median’ topics 
for the monolingual tasks (see Table 2), we believe that we have demonstrated that 
language independent techniques can be quite effective. 

Table 2. Results for monolingual task 





avg prec 


recall 


# topics 


# best 


# > median 


aplmofr 


0.4655 


523 / 528 


34 


9 


21 


aplmoge 


0.4501 


816/821 


37 


10 


32 


aplmoit 


0.4187 


329/338 


34 


6 


20 


aplmoen 


0.4193 


563 / 579 


33 


(unofficial English run) 



One of our objectives was to compare the performance of the constituent word and 
n-gram runs that were combined for our official submissions. Figure 2 shows the 
precision-recall curves for the base and combined runs for each of the four languages. 
Our experience in the TREC-8 CLIR track Q led us to believe that n-grams and 
words are comparable, however each seems to perform slightly better in different 
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Fig. 2. Comparison of retrieval performance using unstemmed words, 6-grams, and a 
combination of the two approaches for each of the four languages. 



languages. In particular, n-grams performed appreciably better on translated German 
queries, something we attribute to a lack of decompounding in our word-based runs. 
This trend was continued this year, with 6-grams performing just slightly better in 
Italian and French, somewhat better in German, but dramatically worse in our 
unofficial runs of English queries against the bilingual relevance judgments. We are 
stymied by the disparity between n-grams and words in English and have never seen 
such a dramatic difference in other test collections. Nonetheless, the general trend 
seems to indicate that combination of these two schemes has a positive effect as 
measured by average precision. 

Our method of combining two runs is to normalize scores for each topic in a run 
and then to merge multiple runs by the normalized scores. 
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4 Bilingual Experiments 

Our goal for the bilingual task was to evaluate two methods for translating queries, 
commercial machine translation software and a method based on aligned parallel 
corpora. While high quality MT products are available only for certain languages, the 
languages used most commonly in Western Europe are well represented. We used the 
Systran product which supports bi-directional conversion between English and the 
French, German, Italian, Spanish, and Portuguese languages. We did not use any of 
the domain specific dictionaries that are provided with the product because we 
focused on automatic methods, and it seemed too difficult to determine which 
dictionary(ies) should be used for a particular query absent human guidance. 

The run, aplbifrc, was created by converting the French topic statements to English 
using Systran and searching the LA Times collection. As with the monolingual task 
both 6-grams and words were used separately and the independent results were 
combined. Our other official run using Systran was aplbispa that was based on the 
Spanish topic statements. 

We only had access to large aligned parallel texts in English and French. We were 
therefore unable to conduct experiments in corpora-based translation in other 
languages. Our English / French dataset indeed text from the Hansard Set-A|Q, 
Hansard Set-C^, United Nations!^, RALI^^, and JOCQ corpora. The Hansard 
data accounts for the vast majority of the collection. 

Table 3. Description of the parallel collection used for aplbifrb 





Description 


Hansard Set- A 


2.9 million aligned sentences 


Hansard Set-C 


aligned documents, converted to -400,000 aligned 
sentences 


United Nations 


25,000 aligned documents 


RALI 


18,000 aligned web documents 


JOC 


10,000 aligned sentences 



The process that we used for translating an individual topic is shown in Figure 3. 
First we perform a pre-translation expansion on a topic by running that topic in its 
source language on a contemporaneous expansion collection and extracting terms 
from top ranked documents. Thus for our French to English run we use the Le Monde 
collection to expand the original topic which is then represented as a weighted list of 
sixty words. Since the Le Monde collection is contemporaneous with the target LA 
Times collection it is a terrific resource for pre-translation query expansion. Each of 
these words is then translated to the target language (English) using the statistics of 
the aligned parallel collection. We selected a single ‘best’ translation for each word 
and the translated word retained the weight assigned during topic expansion. Our 
method of producing translations is based on a term similarity measure similar to 
mutual information 0; we do not use any dimension reduction techniques such as 
CL-LSI 1^. The quality of our translation methodology is demonstrated for Topic 
C003 in Table 4. Finally we processed the translated query on the target collection in 
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four ways, using both 6-grams and words and by using and not using relevance 
feedback. 



Original Query 




Pre-translation 

expansion 


1=^ 


Translation of 
single words 




Run translated 
query 






Le Monde 
Collection 




Hybrid Parallel 
Collection 




LA Times 



Fig. 3. Translation approach for aplbifrb, our official French/English bilingual mn using 
aligned parallel corpora. 



Official French Query 

<F-title> La drogue en Hollande 

<F-desc> Quelle est la politique des Pays-Bas en matitre de drogue? 

<F-narr> Les documents pertinents exposent la rizyiementatlon et les divisions du gouvernement 
n(2ferlandais concernant la vente et la consommatlon de drogues douces et dures. 

Official English Query 

<E-title> Drugs in Holland 

<E-desc> What is the drugs policy in the Netherlands? 

<E-narr> Relevant documents report regulations and decisions made by the Dutch government 
regarding the sale and consumption of hard and soft drugs. 

Systran translation of French query 

<F-title> Drug in Holland 

<F-desc> Which Is the policy of the Netherlands as regards drug? 

<F-narr> The relevant documents expose the regulation and the decisions of Dutch government 
concerning the sale and theconsurrption of soft and hard drugs. 



Fig. 4. Topic C003 in the official French and English versions and as translated by Systran 
from French to English. 



We obtained superior results using translation software instead of our corpora- 
based translation. The precision-recall graph in Figure 5 shows a clear separation 
between the Systran-only run (aplbifrc) with average precision 0.3358 and the 
corpora-only run {aplbifrb) with average precision of 0.2223. We do not interpret this 
difference as a condemnation of our approach to corpus-based translation. Instead we 
agree with Braschler et al. that “MT cannot be the only solution to CLIR jj.” Both 
translation systems and corpus-based methods have their weaknesses. A translation 
system is particularly susceptible to named entities not being found in its dictionary. 
Perhaps as few as 3 out of the 40 topics in the test set mention obscure names: topics 
2, 8, and 12. Topics 2 and 8 have no relevant English documents, so it is difficult to 
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assess whether the corpora-based approach would outperform the use of dictionaries 
or translation tools on these topics. The run aplbifra is simply a combination of 
aplbifrb and aplbifrc that we had expected to outperform the individual runs. 



Table 4. Topic C003. French terms produced during pre-translation expansion and single word 
translation equivalents derived from parallel texts. 



Weight 


1 French 


1 English 


Weight 


1 French 


1 English 


0.0776 


drogue 


drug 


0.0085 


prison 


prison 


0.0683 


drogues 


drugs 


0.0084 


suppression 


removal 


0.0618 


deuces 


freshwater 


0.0083 


probILme 


problem 


0.0595 


dures 


harsh 


0.0083 


produits 


products 


0.0510 


consommation 


consumer 


0.0082 


p0ialisation 


penalty 


0.0437 


matitre 


policy 


0.0080 


sant0 


health 


0.0406 


has 


low 


0.0078 


actuellement 


now 


0.0373 


vente 


sales 


0.0078 


consommateurs 


consumers 


0.0358 


hollande 


holland 


0.0078 


s0/ir 


against 


0.0333 


nBbrIandais 


netherlands 


0.0077 


r0lexion 


reflection 


0.0174 


cannabis 


cannabis 


0.0077 


rapport 


report 


0.0161 


stup0iants 


narcotic 


0.0077 


professeur 


professor 


0.0158 


dEfiKhalisation 


decriminalization 


0.0077 


personnes 


people 


0.0150 


usage 


use 


0.0077 


souterraine 


underground 


0.0141 


trafic 


traffic 


0.0077 


partisans 


supporters 


0.0133 


lutte 


inflation 


0.0076 


sida 


aids 


0.0133 


toxicomanie 


drug 


0.0076 


d0Dat 


debate 


0.0124 


legalisation 


legalization 


0.0076 


francis 


francis 


0.0123 


h0olie 


heroin 


0.0075 


europe 


europe 


0.0119 


toxicomanes 


drug 


0.0075 


membres 


members 


0.0117 


usagers 


users 


0.0092 


peines 


penalties 


0.0105 


droguefe 


drug 


0.0092 


cocase 


cocaine 


0.0104 


refiression 


repression 


0.0091 


alcool 


alcohol 


0.0103 


pr0/ention 


prevention 


0.0089 


seringues 


syringes 


0.0098 


lei 


act 


0.0089 


risques 


risks 


0.0098 


substances 


substances 


0.0088 


substitution 


substitution 


0.0098 


trafiquants 


traffickers 


0.0087 


distinction 


distinction 


0.0098 


haschich 


hashish 


0.0087 


m0hadone 


methadone 


0.0095 


marijuana 


marijuana 


0.0087 


dealers 


dealers 


0.0094 


probILmes 


problems 


0.0086 


soins 


care 



There are several reasons why our translation scheme might be prone to error. 
First of all, the collection is largely based on the Hansard data, which are transcripts 
of Canadian parliamentary proceedings. The fact that the domain of discourse in the 
parallel collection is narrow compared to the queries could account for some 
difficulties. And the English recorded in the Hansard data is formal, spoken, and uses 
Canadian spellings whereas the English document collection in the tasks is informal, 
written, and published in the United States. It should be also noted that generating 6- 
grams from a list of words rather than from prose leaves out any n-grams that span 
word boundaries; such n-grams might capture phrasal information and be of particular 
value. Finally we had no opportunity to test our approach prior to submitting our 
results; we are confident that this technique can be improved. 
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French to English Bilingual Performance 




B fra (0.3212) 

A frb (0.2223) 

X frc (0.3358) 

frb2 (0.2595) 



Fig. 5. Comparison of aplbifra (combination), aplbifrb (parallel corpus), and aplbifrc (Systran). 

With some post-hoc analysis we found one way to improve the quality of our 
corpus-based runs. We had run the translated queries both with, and without the use 
of relevance feedback. It appears that the relevance feedback runs perform worse 
than those without this normally beneficial technique. The dashed curve in Figure 5 
labeled ‘ffb2’ is the curve produced when relevance feedback is not used with the 
corpora-translated query. When not utilizing post-translation relevance feedback we 
observed an improvement in average precision from 0.2694 to 0.3145. Perhaps the 
use of both pre-translation and post-translation expansions introduces too much 
ambiguity about the query. 

Below are our results for the bilingual task. There were no relevant English 
documents for topics 2, 6, 8, 23, 25, 27, and 35, leaving just 33 topics in the task. 

Table 5. Results for bilingual task 





avg prec 


% mono 


recall 

(579) 


# best 


#> 

median 


method 


aplbifra 


0.3212 


80.57% 


527 


6 


27 


Combine 

aplbifrb/aplbifrc 


aplbifrb 


0.2223 


55.75% 


479 


4 


23 


Corpora FR to EN 


aplbifrc 


0.3358 


84.23 % 


521 


7 


23 


Systran FR to EN 


aplbispa 


0.2595 


73.28% 


525 


5 


27 


Systran SP to EN 


aplbige 


0.4034 


83.49% 


529 


unofficial run 


Systran GE to EN 


aplbiit 


0.3739 


77.38% 


545 


unofficial run 


Systran IT to EN 
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5 Multilingual Experiments 

We did not focus our efforts on the multilingual task. We selected English as the 
topic language for the task and used Systran to produce translations in French, 
German, and Italian. We performed retrieval using 6-grams and words and then 
performed a multi-way merge using two different approaches, merging normalized 
scores and merging runs by rank. 

The large number of topics with no relevant documents in the collections of 
various languages suggests that the workshop organizers were successful in selecting 
challenging queries for merging. It seems clear that more sophisticated methods of 
multilingual merging are required to avoid a large drop in precision from the 
monolingual and bilingual tasks. 

Table 6. Results for official multilingual submissions 





avg prec 


recall 


# best 


# > median 


method 


aplmua 


0.2391 


1698/2266 


1 


30 


rank 


aplmub 


0.1924 


1353 /2266 


3 


23 


score 



6 Conclusions 

The CLEF-2000 workshop has provided an excellent opportunity to explore the 
practical issues involved in cross-language information retrieval. We approached the 
monolingual task believing that it is possible to achieve good retrieval performance 
using language-independent methods. This methodology appears to have been borne 
out based on the results we obtained using a combination of words and n-grams. For 
the bilingual task we kept our philosophy of simple methods, but also used a high- 
powered machine translation product. While our initial experiments using parallel 
corpora for translation were not as effective as those with machine translated queries, 
the results were still quite credible and we are confident this technique can be 
improved further. 
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Abstract. The experiment setup that was used for Eurospider's CLEF partici- 
pation is described and a preliminary analysis of the results that were obtained 
is given. Three runs each were submitted for the multilingual and monolingual 
tasks. The goal of these experiments was to investigate query translation using 
different methods, as well as document translation. A main focus was the use of 
so-called similarity thesauri for query translation. This approach produced 
promising results, and shows potential for future adaptations. 



1 Introduction 

This paper describes our experiments conducted for CLEF 2000. We will begin by 
outlining our system setup, including details of the collection and indexing. This is 
followed by a description of the particular characteristics of the individual experi- 
ments, and a preliminary analysis of our results. The paper closes with a discussion of 
our findings. 

Eurospider participated in the multilingual and monolingual retrieval tasks. For 
multilingual retrieval, we investigated both document and query translation, as well as 
a combination of the two approaches. For translation, we used similarity thesauri, a 
bilingual wordlist and a machine translation system. Various combinations of these 
resources were tested and are discussed in the following. 



2 Multilingual Retrieval 

The goal of the multilingual task in CLEF is to pick a topic language, and use the que- 
ries to retrieve documents regardless of their language. I.e., a mixed result list has to 
be returned, potentially containing documents in all languages. The CLEF test collec- 
tion consists of newspapers for German (Frankfurter Rundschau, Der Spiegel), French 
(Le Monde), Italian (La Stampa) and English (LA Times). 
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We submitted three runs for this task, labeled EITCLEFMl, EITCLEFM2, and 
EITCLEFM3. They represent increasingly complex experiments. All runs use the 
German topics and all topic fields. We spent our main effort to produce these multi- 
lingual experiments. In contrast, the monolingual runs were base runs for the multilin- 
gual work, and were sent in mainly to have a comparison base. 

We investigated both query translation (also abbreviated "QT" in the following) 
and document translation ("DT"). Technologies used for query translation were simi- 
larity thesauri ("ST"), a bilingual wordlist ("WL") and a commercially available ma- 
chine translation ("MT") system. For document translation, the same MT system was 
used. 

Following is a description of these key technologies. 

Similarity Thesaurus: The similarity thesaurus is an automatically calculated data 
structure, which is built on suitable training data. It links terms to lists of their statisti- 
cally most similar counterparts [3]. If multilingual training data is used, the resulting 
thesaurus is also multilingual. Terms in the source language are then linked to the 
most similar terms in the target language [4]. Such a thesaurus can be used to produce 
a "pseudo-translation" of the query by substituting the source language terms with 
those terms from the thesaurus that are most similar to the query as a whole. 

We used training data provided by the Schweizerische Depeschenagentur (SDA, 
the Swiss national news wire) to build German/French and German/Italian similarity 
thesauri. A subset of this data was used earlier as part of the TREC6-8 CLIR test col- 
lection. All in all, we used a total of 1 1 years of news reports. While SDA produces 
German, French and Italian news reports, it is important to note that these stories are 
not actual translations. They are written by different editorial staff in different places, 
to serve the interests of the different audiences. Therefore, the SDA training collection 
is a comparable corpus (as compared to a parallel corpus, which contains actual 
translations of all items). The ability of the similarity thesaurus calculation process to 
deal with comparable corpora is a major advantage, since these are usually easier to 
obtain than the rare parallel corpora. 

Unfortunately, we were not able to obtain suitable German/English training data in 
time to also build a German/English thesaurus. Instead, we opted to use a bilingual 
German/English wordlist. As will be shown below, this was likely a disadvantage. 

Bilingual wordlist Because of the lack of English training data, we used a Ger- 
man/English bilingual wordlist for German/English crosslingual retrieval. We assem- 
bled this list from various free sources on the Internet. This means that the wordlist is 
simplistic in nature (only translation pairs, no additional information such as gram- 
matical properties or word senses) and noisy (i.e. there is a substantial amount of in- 
correct entries). 

Machine translation system: For a limited number of language pairs, commercial end- 
user machine translation products are available nowadays. Since some of these sys- 
tems are inexpensive and run on standard PC hardware, we decided to try and link 
such a product with both our translation component and our retrieval software. We 
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therefore used MT to translate the document collection, enabling us to use the trans- 
lated documents in our retrieval system, and also to translate the queries, combining 
those with the translation output from the similarity thesaurus. 

Indexing: We used the standard RotondoSpider retrieval system developed at Euro- 
spider for indexing and retrieval. Additional components were used for query transla- 
tion and blind feedback. 

Indexing of German documents and queries used the Spider German stemmer, 
which is based on a dictionary coupled with a rule set for decompounding of German 
nouns. 

Indexing of French documents and queries used the Spider French rule-based 
stemmer. French accents were retained, since we decided that the quality of the data 
from Le Monde ensured consistent use of accenting. 

Indexing of Italian documents and queries used the Spider Italian rule-based stem- 
mer. There was a simple preprocessing that replaced the combination "vowel -I- quote" 
with an accented vowel, since the La Stampa texts use this alternative way of repre- 
sentation for accented characters. This simple rule produces some errors if a word was 
intentionally quoted, but the error rate was considered too small to justify the devel- 
opment of a more sophisticated replacement process. 

Indexing of English documents used an adapted version of the Porter rule-based 
stemmer. 

The Spider system was configured to use a straight Lnu.ltn weighting scheme for 
retrieval, as described in [5]. 

The ranked lists for the three multilingual runs were obtained as follows: 

EITCLEFMl : We built one large unified index containing all the German documents 
plus all the English, French and Italian documents in their German translations as ob- 
tained by MT. It is then possible to perform straight monolingual German retrieval on 
this combined collection. An added benefit is the avoidance of the merging problem 
that typically arises when results are calculated one language at a time. Since only one 
search has to be performed on one index, only one ranked list is obtained. 

EITCLEFM2: Our second submission has a different focus. Instead of document 
translation, we used only query translation for this experiment. We obtained individual 
runs for each language pair (German/German, German/French, German/Italian, and 
German/English). For each pair, we used two different translation strategies (or in the 
case of German/German, two different retrieval strategies). For retrieval of the French 
and Italian documents, we translated the German queries both using an appropriate 
similarity thesaurus and using the MT system. For search on the English collection, we 
again used the MT system, but additionally used the German/English bilingual word- 
list. The two German monolingual runs were a simple, straightforward retrieval run, 
and a run that was enhanced through blind relevance feedback (for a discussion of 
blind feedback and some possible enhancements to it, see e.g.[2]). The choice of rele- 
vance feedback was to "imitate" the expansion effect of the similarity thesaurus for the 
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other languages. We expanded the query by the twenty statistically best terms from the 
top 10 initially retrieved documents. 

The two runs for each language are merged by adding up the ranks of a document 
in both individual runs to form a new score. In order to boost documents with high 
ranks, we used the logarithms of the ranks of the documents in both experiments. 

new_score = MAX - (log (rank_run_l) + log (rank_run_2 ) ) ; 

The step resulted in four runs, one per language combination. These were then 
merged by taking a document each in turn from each run, thus producing the final 
ranked list (this process is sometimes also referred to as "interleaving"). 

EITCLEFM3: The last multilingual experiment combines elements from both the QT 
and DT-based runs. To produce the final ranked list, these two runs are merged by 
setting the score to the sum of the logarithms of the ranks, as described above. 




Fig. 1. Procedure to obtain the multilingual experiments 



3 Monolingual Retrieval 

We also submitted three runs for the monolingual task named EITCLEFGG, 
EITCLEFFF and EITCLEFII (German, French and Italian monolingual, respectively). 
These runs all use the full topics (all fields). As mentioned earlier, they were produced 
mainly to serve as baselines for comparison. The main effort was invested into the 
multilingual experiments. 
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EITCLEFGG: This was our German monolingual submission. It is the straight re- 
trieval run that was used to produce the EITCLEFM2 run (see above). 



EITCLEFFF and EITCLEFII: These two runs were also obtained through straight 
monolingual retrieval using the French and Italian queries, respectively. 



4 Results 

Looking at the results, the document translation-based run outperforms the query 
translation-based run. However, looking at the individual parts that make up the QT- 
based run, we notice that the translation using the bilingual wordlist performs poorly. 
It seems likely that the actual difference would be significantly smaller if a good Eng- 
lish similarity thesaurus was available. 



Table 1. Average precision numbers for tbe multilingual experiments 



Runs against Multilingual Collection 


Average Precision 


EITCLEFMl 


0.2816 


EITCLEFM2 




EITCLEFM3 





The combined run produces the best results, and does so on a consistent basis. As 
shown in table 2, the majority of queries improves, often substantially, in terms of 
average precision when compared to either the DT-only or QT-only run. The picture 
is less conclusive for the comparison between DT-only and QT-only. We think that 
this shows that whereas both approaches have strengths, they mix well in the com- 
bined run to boost performance. 



Table 2. Comparison of average precision numbers for individual queries 



Comparison better; better; worse; worse; 

Avg. Free, per Query diff>10% diff <10% diff.<10% diff >10% 



EITCLEFM3 (comb.) 
EITCLEFMl (DT) 


VS. 


16 


16 


6 


2 


EITCLEFM3 (comb.) 
EITCLEFM2 (QT) 


vs. 


19 


12 


4 


5 


EITCLEFMl (DT) 
EITCLEFM2 (QT) 


vs. 


14 


10 


5 


11 
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We also studied individual language pairs and the impact of the different query 
translation strategies. 



Table 3. Average precision numbers for the German monolingual runs 



Runs against German Collection 


Average Precision 


Straight 


0.4030 


Blind Feedback 


0.3994 



It seems like the blind feedback loop did not help boost performance. In any case, 
the difference is so slight that it can be considered meaningless. A per-query analysis 
shows that most queries are affected little by the feedback, and that the number of 
queries with a substantial increase or decrease in average precision is exactly the 
same. This reinforces the conclusion that the feedback was not helpful in this case. 



Table 4. Average precision numbers for runs against the French collection 



Runs against French Collection 


Average Precision 


Monolingual 


0.3884 


MTG/F 


0.3321 


Similarity Thesaurus G/F 


0.2262 


Combined G/F 


0.3494 



The French MT -based run outperforms the similarity thesaurus-based run substan- 
tially. However, a sizable part of the difference can be attributed to five queries that 
failed completely using the thesaurus (we consider a query a complete failure if the 
result has an average precision < 0.01). For the rest of the queries, the similarity the- 
saurus performed well, even outperforming the MT -based run by more than 10% for 
eight queries in terms of average precision. The combined run gives a modest im- 
provement over the MT run. 20 queries benefit from the combination, whereas the 
performance of the remaining 14 queries falls. 



Table 5. Average precision numbers for runs against the Italian collection 



Runs against Italian Collection 


Average Precision 


Monolingual 


0.4319 


MTG/I 


0.3306 


Similarity Thesaurus G/I 


0.2568 


Combined G/I 


0.3636 



In Italian, the similarity thesaurus is closer to the performance of the MT -based run. 
Again, a big part of the difference is due to 7 queries failing completely when using 
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the thesaurus. The combination is a reasonable improvement over the MT-only run, 
gaining 10% in average precision. 



Table 6. Average precision numbers for runs against tbe English collection 



Runs against English Collection 


Average Precision 


Monolingual 


0.3879 


MTG/E 


0.3753 


Wordlist G/E 


0.1414 


Combined G/E 


0.2809 



The good performance of the MT -based German/English run is striking. This 
probably is due to the main effort in MT research still going into language combina- 
tions involving English. The poor performance of the run using the bilingual wordlist 
is also noteworthy. While this might be partly due to shaky quality of the input 
sources, we think that it underscores how important word sense disambiguation is, 
something which MT and the similarity thesaurus try to address, but which is lacking 
from our wordlist. It seems obvious that bilingual wordlists/dictionaries are not com- 
petitive without a serious investment of effort in that direction. 

We are pleased to see that our runs compare favorably when compared to other en- 
tries in CLEF. Table 7 shows an analysis of per-query performance compared to the 
median performance of all participants. Especially the multilingual runs performed 
strongly, and the two runs EITCLEFMl and EITCLEFM3 outperform all other offi- 
cially reported results for CLEF 2000. The monolingual runs are more mixed, which 
was to be expected, since we did not tune them specifically for performance. The 
German run seems to perform nicely, placing among the best runs for this language. 
We believe this to be due to the compound analysis in the Spider stemming, since all 
competitive German experiments by other participants have addressed the decom- 
pounding problem in one way or another. The results for French and Italian indicate 
room for improvement. It is interesting to see that participants in French and Italian 
monolingual task in general obtained similar performance. 



Table 7. Officially submitted runs compared to median of all submitted runs 
(on individual query basis) 



Run 


Best 


Above 


Median 


Below 


Worst 


# queries 


EITCLEFMl 


1 


29 


0 


10 


0 




EITCLEFM2 


1 


22 


2 


15 


0 




EITCLEFM3 


7 


23 


1 


9 


0 




EITCLEFGG 


6 


17 


6 


8 


0 


37 


EITCLEFFF 


0 


7 


5 


22 


0 


34 


EITCLEFII 


3 


7 


7 


17 


0 


34 
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5 Conclusions 

Overall, we think the performance of the similarity thesaurus is remarkable. While it 
did not produce results equal to the MT -based runs, it is important to note that we 
were in a "worst-case scenario": the thesauri were built on a comparable corpus (no 
real translations, as opposed to a parallel corpora), and there was no overlap in train- 
ing data and the test collection. This means that similar requirements for other transla- 
tion scenarios can be quite easily matched. I.e., it would be easy to build similarity 
thesauri with comparable performance for a multitude of additional language pairs, 
even exotic ones, simply by gathering suitable training data, such as taking a sufficient 
amount of text from one national newspaper each. Also, the performance of the simi- 
larity thesaurus will get a sizeable boost when the problems can be addressed that led 
to a complete failure in translation of a number of queries. We should be able to do 
this by increasing the size of the thesaurus, which again is only a matter of processing 
more training data. Note also that the thesaurus is suited for situations in which the 
query length is much shorter, such as Web searches. As shown during the Eurosearch 
project (for a short description of Eurosearch, see [1]), the expansion effect of the 
thesaurus is beneficial for the short queries. Machine translation systems traditionally 
have problems with short, keyword style queries. 

Document translation gave us some good results, and was feasible for a collection 
of the size of the CLEF test collection. This means that DT should not be discounted 
for reasonably static collections with limited size. Note, however, that some of the 
advantage we found for DT versus query translation may be due to the inadequate 
performance of the wordlist we used for English. Also, QT clearly remains the only 
possibility for huge or highly dynamic collections. 
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Abstract. The primary goal of our participation in CLEF is to acquire 
experience with supporting cross- lingual retrieval. We submitted runs for 
all four target languages, but our main interest has been in the bilingual 
Dutch to English runs. We investigated whether we can obtain a rea- 
sonable performance without expensive (but high quality) resources; we 
have used only ‘off-the-shelf’, freely available tools for stopping, stem- 
ming, compound-splitting (only for Dutch) and translation. Although our 
results are encouraging, we must conclude that a poor man’s approach 
should not expect to result in rich men’s retrieval results. 



1 Goals 

The Mirror DBMS |2| aims specifically at supporting both data management and 
content management in a single system. Its design separates the retrieval model 
from the specific techniques used for implementation, thus allowing more flexibil- 
ity to experiment with a variety of retrieval models. Its design based on database 
techniques intends to support this flexibility without causing a major penalty on 
the efficiency and scalability of the system. The support for information retrieval 
in our system is presented in detail in g, P, and p. 

The primary goal of our participation in CLEF is to acquire experience with 
supporting Dutch users. Also, we want to investigate whether we can obtain a 
reasonable performance without requiring expensive (but high quality) resources. 
We do not expect to obtain impressive results with our system, but hope to 
obtain a baseline from which we can develop our system further. We decided to 
submit runs for all four target languages, but our main interest is in the bilingual 
Dutch to English runs. 



2 Pre-processing 

We have used only ‘off-the-shelf’ tools for stopping, stemming, compound-split- 
ting (only for Dutch) and translation. All our tools are available for free, without 
usage restrictions for research purposes. 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 149 - 17^71 2001. 
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Table 1. Size of the stoplists used. 



Language /(Iwords 


Dutch 


124 


English 


95 


German 


238 


French 


218 


Italian 


133 



Stopping and Stemming 

Moderately sized stoplists, of comparable coverage, were made available by Uni- 
versity of Twente (see also Tabled). 

We used the stemmers provided by Muscat0. an open source search engine. 
The Muscat software includes stemmers for all five languages, as well as Spanish 
and Portuguese. The stemming algorithms are based on the Porter stemmer. 



Dictionaries 

The Ergane translation dictionaries were made available by Gerard van Wilgen. 
To avoid the necessity of a bilingual wordlist for every possible language combi- 
nation, Ergane uses the artificial language Esperanto as an interlingua. Ergane 
supports translation from and to no less than 57 languages, although some lan- 
guages are only covered by a few hundred words. The number of entries in the 
dictionaries used are summarized in Table d 



Table 2. Number of entries in the Ergane dictionaries. 

Language #words 
Dutch 56,006 

English 15,812 

French 10,282 

German 14,410 

Italian 3,793 



Because of synonyms, the size of bilinugal dictionaries might actually be 
bigger than the size of the smallest word-list of a language pair. After removal of 
multiword expressions, the number of Dutch entries in the bilingual translation 
lexicons are presented in Table 0 

Note that these dictionary sizes are really small compared to dictionaries 
used in other cross-language retrieval experiments. For instance, Hiemstra and 
Kraaij have used professional dictionaries that are about 15 times as large 0. 

^ http://open.muscat.com/ 

^ http://www.travlang.com/Ergane/ 
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Table 3. Sizes of the bilingual dictionaries (from Dutch to target language). 

Target ^words 
English 20,060 
French 15,158 
German 15,817 
Italian 6,922 



Compound-Splitting 

Compound-splitting was only used for the Dutch queries. We applied a simple 
compound-splitter developed at the University of Twente. The algorithm tries 
to split any word that is not in the bilingual dictionary using the full word-list of 
about 50,000 Dutch words from Ergane. The algorithm tries to split the word in 
as little parts as possible. It encodes a morphological rule to handle a property 
known as ‘tussen-s’, but it does not use part-of-speech information to search for 
linguistically plausible compounds. 

Because the Dutch word-list used for splitting was much larger than the 
number of entries in the bilingual dictionaries, compound-splitting might re- 
sult in words that are only partially translated. For example, the Dutch word 
‘wereldbevolkingsconferentie’ (topic 13, English: ‘World Population Conference’) 
was correctly split in three parts: ‘wereld’, ‘bevolking’ and ‘conferentie’ of which 
only the first two words have entries in the Dutch-to-French dictionaryH 

3 System 

For a detailed description of our retrieval system, we refer the interested user to 
0 . The underlying retrieval model is best explained in our technical report^ . 
It supplements the theoretical basis of the model with a series of experiments, 
comparing this model with other, more common retrieval models. 

4 Results 

This section discusses the results obtained with our system. We discuss the 
retrieval results expressed in average precision, and, the coverage of our trans- 
lations. After discussing the official runs, we present some tests performed with 
pre-processing Dutch topics. 

4.1 Official Results 

All experiments were done using the title and description fields of the topics. 
The average query length for Dutch was 10.5 after stopping (which is of course 

^ This example also illustrates the ‘tussen-s’ rule: the ‘s’ between ‘bevolking’ and 
‘conferentie’ has been correctly removed. 

http://wwwhome.cs.utwente.nl/'hiemstra/papers/index.html^ctit 
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Table 4. Summary of results (after fixes). 



queries Average Prec. R-prec. 



English 


33 


0.4070 


0.4163 


French 


33 


0.4090 


0.3831 


German 


36 


0.3134 


0.3149 


Italian 


36 


0.3980 


0.3935 


Bi-lingual 


32 


0.2375 


0.2392 


Multi-lingual 


39 


0.1018 


0.1448 


Table 5. The submitted, flawed r 


esults. 




# queries Average Prec. R-prec. 


German 


37 


0.1794 


0.2032 


Multi-lingual 


39 


0.0864 


0.1330 



rather long compared to the average query size people enter in e.g. web search 
engines) . 

Table M summarizes our results. The second column shows the number of 
queries with hits in the monolingual runs; the third and fourth columns show 
the mean average precisiorj^. The monolingual results for English have been 
based on the bilingual qrels. The last column summarizes the drop in average 
precision that can be attributed to the translation process. 



Table 6. Official results (after fixes). 



^ queries Monolingual Dutch — > X relative 



English 


33 


0.4070 


0.2303 


57% 


French 


34 


0.4090 


0.1486 


36% 


German 


37 


0.3134 


0.1050 


34% 


Italian 


34 


0.3980 


0.0989 


24% 



We hypothesize from the relatively low average precision (0.3134) on the 
monolingual German task that we really have to perform compound-splitting of 
this corpus. Another possible cause of the lower score for German is that we had 
to merge the runs from the two subcollections, which were handled separately. 
But, our experiments on TREG-8 showed that this cannot really explain such a 
performance drop. 

We attribute the large drop in performance for e.g. the bilingual Italian task 
(only 24% of the average precision of the monolingual task) to the small coverage 
of our translation dictionaries. The coverage of the topic translations produced 
has been summarized in table 0 

® The mean average precision for the bilingual runs as given by trec_eval, normalized 
for the number of queries with hits in the monolingual case. 
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Together, the inferior results on German and Italian explain the disappointing 
average precision obtained on the multilingual retrieval task (0.0864). 



Table 7. Coverage of the translations (40 queries). 



experiment total terms not translated relative 



Dutch — > 


English 


420 


92 


22% 


Dutch — > 


French 


420 


138 


33% 


Dutch — > 


German 


420 


115 


27% 


Dutch — > 


Italian 


420 


199 


47% 



4.2 Morphological Normalisation and Compound-Splitting 

Our primary goal with CLEF participation is to test whether we could pro- 
vide a Dutch interface to our retrieval systems. To confirm our intuition about 
stemming and compound-splitting, we performed some test runs to analyze the 
effects of morphological normalisation and compound-splitting for Dutch. We 
either performed stemming or not, and performed compound-splitting or not, 
resulting in four variants of the system: 



nlenl: base-line translation using full- form dictionary 

nleii2: translation using Dutch stemmer and a dictionary with stemmed entries 
nlenS: translation using compound-splitter for Dutch and full-form dictionary 
nlen4: translation using compound-splitter and dictionary with stemmed 
entries 

The results of these runs are summarized in Table 0 We conclude that 
compound-splitting is very important, and stemming seems a useful pre-pro- 
cessing step. 



Table 8. Results on Dutch runs (33 queries). 



run average precision improvement 



nlenl 


0.1726 




nlen2 


0.2228 


29% 


nlen3 


0.1912 


11% 


nlen4 


0.2303 


33% 



To support these conclusions. Table El summarizes the coverage of the vari- 
ous translations used in the Dutch runs. Compound-splitting and morphological 
stemming of Dutch words nearly triples the relative coverage of the translation 
dictionaries. The total of 92 untranslated Dutch terms in the English queries 
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Table 9. Coverage of the translations (40 queries). 



experiment total terms not translated relative 



nlenl 


366 


201 


57% 


nlen2 


366 


130 


36% 


nlenS 


420 


160 


38% 


nlen4 


420 


92 


22% 



include about 13 proper names like ‘Weinberg’, ‘Salam’ and ‘Glashow’ (topic 2) 
and a few terms that were left untranslated in the Dutch topics like ‘Academie 
Frangaise’ (topic 15) and ‘Deutsche Bundesbahn’ (topic 40). 

5 Conclusions and Future Work 

Summarizing our experiments, we may conclude that our retrieval models works 
well for all monolingual runs, except for German. Future experiments will have to 
confirm whether a process like compound-splitting will indeed bring our mono- 
lingual results to a level comparable to the other languages. The influence of 
compound-splitting of Dutch topics on the bilingual results raises our expecta- 
tions on this end. 

We were not at all unhappy with our bilingual results. But, from the coverage 
of the translations, we still have to conclude that a poor man’s approach should 
not expect to result in rich men’s retrieval results. However, we cannot blame it 
all on the dictionaries. The current version of our retrieval system does not use 
query expansion techniques to improve mediocre translations; it remains to be 
seen if better statistical techniques can bring us closer to the results obtained 
with ‘proper’ linguistic tools. 
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Abstract. This report describes the work done for our participation in the 
multilingual track of the Cross-Language Evaluation Forum (CLEF). We use a 
dictionary-based approach to translate English queries into German, French and 
Italian queries. We then apply a term disambiguation technique to select the 
best translation terms from terms found in the dictionary entries, and a query 
expansion technique to enhance the queries’ retrieval performance. We show 
that the word-formation characteristics of different languages affect the 
effectiveness of statistical techniques in dealing with the ambiguity problem. 



1 Introduction 

A multilingual environment poses many interesting challenges to the field of cross- 
language information retrieval (CLIR). In CLIR, we deal with a query in one language 
and documents or information to retrieve in another language or, in the case of a 
multilingual environment, many languages. CLIR techniques, in general, always 
involve some kind of language translation process. A number of translation techniques 
have been proposed by CLIR researchers such as ones that use NLP -based machine 
translation algorithms, parallel corpora, and machine readable dictionaries. Machine 
translation systems have been shown to produce good results, however, such systems 
are only available for a few languages. Parallel corpus-based techniques have also 
been proven to show good CLIR results [9]. However, parallel corpora are expensive 
to build, and those that are available are fairly limited, in terms of their domain 
coverage. As a consequence, such techniques often fail in translating terms in a wider 
scope of domain. Fortunately, more and more comparable corpora have been built and 
made available to researchers, recently. Comparable corpora can be considered as 
similar to parallel corpora since they consist of documents in many languages 
concerning the same topics. Hopefully, this will stimulate more research activities in 
the field. 

Translation techniques that use bilingual dictionaries are very practical, as they do 
not require any deeper linguistic knowledge such as syntactic grammars and 
semantics. An ideal dictionary for this purpose would be one that is available in a 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 156-165, 2001. 
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machine readable dictionary (MRD) format, to allow automatic term translations. 
Unfortunately, MRDs for many languages are still relatively expensive to acquire. For 
this reason, we use dictionary resources that are freely available on the Internet to 
translate English queries into German, French and Italian. The translation method is 
straightforward, that is by simply replacing each English term with the translation 
terms found in each of the dictionaries for the term. Clearly, the quality of the 
translation very much depends on the quality and the comprehensiveness of the 
dictionary. We consider the limited vocabulary of our dictionaries as an additional 
challenge. 

Our participation in this year’s Cross-Language Evaluation Forum (CLEF) has 
provided us with an opportunity to better understand the issues in Cross-Language 
Information Retrieval (CLIR) through experimentation. Our previous work has been 
on bilingual CLIR. The multilingual task is different from the bilingual task because 
the collections contain documents in more than one language. In this task, we face the 
challenge of indexing and merging retrieval results from a number of language- 
collections. The indexing can be built as a single index or an individual index for each 
collection. A single index does not need to merge the retrieval results from each 
collection. The second case needs a merging of the different retrieval results in a 
single rank. In our work, we choose to use different indexing for each language. We 
hope that we can learn the characteristic and translation problems of each language. It 
has also provided us with the opportunity to measure the effectiveness of our 
algorithms and techniques using large collections. 

In Section 2, we present a brief survey of relevant work done by other researchers. 
Section 3 provides a review of our sense disambiguation technique, and describes our 
term similarity based query expansion technique, as well as our rank-list merging 
technique. Section 4 discusses the experiments that we conducted to measure the 
effectiveness of our techniques and their results. Finally, Section 5 concludes this 
paper with a summary. 



2 Dictionary-Based Approach 

The dictionary-based query translation approach translates each term in a query to 
another language by replacing it with the senses of that term in the dictionary. There 
are several problems in such translation techniques, mainly, problems with term 
ambiguity, phrase translation, and untranslated terms such as acronyms or technical 
terms that are not found in the dictionary. These problems result in very poor retrieval 
performance of the translation queries. 

A number of statistical and linguistic approaches have been demonstrated to be 
effective in alleviating the ambiguity problem. Ballesteros and Croft [4] use term co- 
occurrence data and part-of-speech tagging to reduce the ambiguity problem from the 
dictionary. A different approach is proposed by Pirkola [11] whose technique reduces 
the effect of the ambiguity problem by structuring the queries and translating them 
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using a general and a domain-specific dictionary. Translating phrases word-by-word 
often results in the loss of the original meaning in the translation. In order to translate 
the phrase correctly, Hull and Grefenstette [8] used a phrase dictionary, which helps to 
improve the retrieval performance. Other researchers showed that a phrase dictionary 
built from a parallel corpus can also be used to recognize and handle phrases [6]. 

To further mitigate the negative effect of mistranslated query terms, many 
researchers have employed query expansion techniques. Query expansion is a well- 
known method in IR for improving retrieval performance. Basically, it adds new 
terms, selected using a certain technique, to the query such that the query becomes 
more precise where the added terms clarify the meaning of the original query terms, 
and its recall is improved as terms associated with the original query terms are added. 
Adriani and Croft [1] employ pseudo relevance feedback techniques to obtain terms 
for the query expansion. The pseudo relevance feedback techniques assume that the 
top rank documents initially retrieved using the queries are relevant. Terms appearing 
in these relevant documents are then added to the queries. They found that post- 
translation query expansion, i.e., query expansion on the translated queries, and the 
combination-translation query expansion, i.e., query expansion on both the original 
and the translated queries, are effective in improving CLIR performance. Adriani and 
van Rijsbergen [2] expand the translated query based on the collective similarity 
between each candidate term and all of the existing terms in the query. 

Merging retrieval results from a number of collections of different languages has 
been done by many researchers in the CLIR task of the Text Retrieval Conference 
(TREC) 1999. Oard et.al. [10] compare the results of using a single index and 
different indexes, but there was no significant difference between the two types of 
index. Other research groups, such as Braschler et.al. [5] of Eurospider, apply a linear 
regression analysis on parallel document alignments. Franz et.al. [7] of IBM use a 
probabilistic model to create a single rank list of multilingual documents. 



2.1 Term Disambiguation Technique 

The sense ambiguity problem occurs in the process of translating queries from one 
language to another using the dictionary approach. In order to select the best 
translation terms from an entry in the dictionary, we apply our term disambiguation 
technique, which is based on the statistical similarity values among terms. This term 
disambiguation technique is based on our previous work [3]. Basically, given a set of 
original query terms, we select for each term the best sense such that the resulting set 
of selected senses contains senses that are mutually related- or statistically similar- 
with one another. For computational cost considerations, this is done using an 
approximate algorithm. Given a set of n original query terms {t\, t 2 , ..., t„}, a set of 
translation terms, T, is obtained using the following algorithm: 
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1 . For each t, (1=1 to «), retrieve a set of senses 5, from the dictionary. 

2. For each set 5, (i=l to n), do steps 2.1, 2.2 and 2.3. 

2.1 For each sense t/ (/=1 to |5,|) in Si, do step 2.1.1 

2.1.1 For each set 5* to n and k i), get the maximum similarity, Mj^k, between t/ 
and the senses in 5* . 

2.2 Compute the score of sense t/ as the sum ofMji^{k=l to n and k i). 

2.3 Select the sense in 5, with the highest score, and add the selected sense into the set 

T. 



Query terms that are not found in the dictionary are included in the translation set T 
as-is. This is typically the case for proper names, technical terms, and acronyms. A 
complete explanation of our technique can be found in [3]. 

We obtain the degree of similarity or association-relation between terms using a 
term association measure, called the Dice similarity coefficient [12], which is 
commonly used in document or term clustering. The term association measure, sim^y, 
between term x and y is computed as follows: 

n n n 

sim,,y = 2 X (w \i ■ w ’yi) / ( X w^i^ + Xwyi^ ) 

/=1 1=1 1=1 

where 

= the weight of term x in document i 
Wyi = the weight of termy in document i 
w\i = if term y also occurs in document i, or 0 otherwise 
w ’yi = Wyi if term x also occurs in document ;, or 0 otherwise 
n = the number of documents in the collection. 

The weight of term x in document i is computed using the standard tPidf term 
weighting formula [13]. 



2.2 Query Expansion Technique 

The resulting translated queries are, of course, worse than the original queries, in 
terms of their accuracy and retrieval effectiveness. We expand the translated queries 
by adding related terms to the queries to further improve their retrieval performance. 
Our query expansion technique also uses the Dice similarity coefficient to build a 
similarity matrix containing the co-occurrences of the terms in document passages. 
First, for each collection, we build a database that contains passages of 200 terms 
each. We then run each query set to obtain the relevant passages. The top 20 passages 
are then used for creating the term similarity matrix. Next, we compute the sum of 
similarity values between each term in the passages and all terms in the query. Finally, 
we added the top 10 terms from the relevant passages to the query. 
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2.3 Rank-List Merging 

The rank-list merging technique is required as we run the translation of each query in 
the query set with each language-collection, independent of the other language- 
collections. The retrieval results from the four language collections are then merged in 
a single rank list. We employ a simple method based on an assumption that the 
highest-rank document in one collection-language is comparable, in terms of the 
relevance to the query, to that of another language. We realize that this assumption is 
not always true, but, owing to lack of time to experiment with other techniques, we 
thought that it was a reasonable assumption. With this assumption, we normalize the 
relevance scores for each collection with the highest score in that collection’s rank list, 
and then merge and sort them in a rank list. 



3 Experiments 

In the multilingual track, the document collections are in four languages, namely, 
English, German, French, and Italian. The collections contain newspaper articles from 
Los Angeles Times (English), Frankfurter Rundschau and Der Spiegel (German), Le 
Monde (French), and La Stampa (Italian). We build the database for each collection 
using the INQUERY information retrieval system. 

From the multilingual query sets, we chose to run the English queries, which were 
then translated using the online dictionaries. We used machine-readable dictionaries 
downloaded from the Internet at http://www.freedict.com . These dictionaries contain 
short translation of English terms in different languages. We realize that these 
dictionaries are not ideal resources for our purpose, as most of the dictionary entries 
contain only one or two senses. However, they are easily obtainable for free from the 
Internet. We reformatted the dictionary files so that our query translator program can 
read them. 

The query translation process proceeds as follows. First, we remove all stop-words 
from the English query and obtain the root-words of the remaining terms using a 
Porter word stemmer. Each term is then substituted with its translation term or terms 
according to the dictionary, excluding any stopwords in the dictionary entries. A query 
phrase is translated by translating each of the phrase’s constituent terms. The 
translation terms are stemmed using the French and the German word stemmers from 
the PRISE retrieval system obtained from NIST. 

We then apply the term disambiguation technique to choose the best translation 
term. The term similarity matrix is then built for each collection. We use the similarity 
values to perform the term disambiguation. The resulting queries are then enhanced by 
applying the query expansion technique, adding 10 terms from a set of 20 relevant 
passages that are relevant to the query terms. The values of 10 and 20 were obtained 
through a preliminary experiment. 
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Finally, we run each query set on its respective document collection, including the 
original English queries on the English collection, and the retrieval results from the 
sets are then combined into a single document ranking. 

In this experiment, we ran two query formats, namely, the title-only (short) and the 
long (full) query formats. Each query in the long query set contains a title, a 
description, and a narrative text of the CLEF query. We chose to do both query sets to 
see whether the results are consistent across both sets. All the steps in the multilingual 
task were done in a fully automatic manner. 



4 Results 

In this work, we participated in the multilingual task by running both the title-only and 
the long query formats. However, only the title-only query run was considered in the 
CLEF relevance assessment pool. 

As can be seen in Table 1, we obtained the best multilingual results, as compared to 
those of the equivalent monolingual runs, for the Italian translation queries. The 
French translation queries came second and, lastly, the German translation queries, 
which performed the poorest. Our investigation into the title-only query run revealed 
that the retrieval performance of each translation query correlates negatively with the 
number of original English terms that are not found in the bilingual dictionary. 
Specifically, our German translation query set contains 4 untranslated English terms 
and a number of stand-alone German terms in place of the 19 German compound 
nouns, which are the correct translations of the 19 English query terms. The French 
and the Italian query sets contain 13 and 23 untranslated English terms, respectively 
(see Table 2). 

Table 1. Average retrieval precision of the monolingual runs using the English, German, 
French, and Italian queries; and the average precision of the cross-lingual runs and the 
merged multilingual runs for English queries translated into German, French and Italian. 
Both the title-only and the long query formats were used 



Query 


Task 


English 


German 


French 


Italian 


Merge 


Title 


Monolingual 


0.2705 


0.2075 


0.2260 


0.0347 


- 


Title 


Cross 

Language 


0.2705 


0.0810 


0.1097 


0.0569 


0.0560 


Long 


Monolingual 


0.3804 


0.2790 


0.2682 


0.1279 


- 


Long 


Cross 

Language 


0.3804 


0.0932 


0.1012 


0.1050 


0.0881 
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In our previous work [2], we obtained results where our German queries perform 
better than the equivalent Spanish queries in retrieving documents from an English 
collection. The reason being that most German compound words in our German query 
set have exact English translations in the dictionary, unlike phrases in the Spanish 
query set which were translated word by word using a bilingual dictionary. In other 
word, the degree of ambiguity of the German queries is less than that of the Spanish 
queries. On the other hand, from this work, we learned that translating English queries 
into German, which involves translating into compound words, is a difficult task. 



Table 2. The number of English terms in the query set, and the number of them 
that are not found respectively in the German, French, and Italian bilingual 
dictionaries 



Query 


English 


German 


French 


Italian 


Title 


114 


4 


13 


23 


Long 


1,112 


13 


43 


61 



4.1 German Result 

The English queries that were translated into German (EG) consist of 2,714 terms for 
the title-only queries and 27,239 terms for the long queries. Ideally, as the equivalent 
German monolingual queries do, the resulting translation queries must contain 125 
terms and 1,81 1 terms for the title-only and the long queries, respectively. 

The EG title-only queries perform 80.67% below the equivalent monolingual 
German queries. Applying the term disambiguation technique improved the retrieval 
performance by 24.39%. The EG long queries drop the performance by 89.82%, as 
compared to the equivalent monolingual German queries. Almost similarly, the term 
disambiguation technique improved the retrieval performance by 24.32%. However, 
the query expansion technique hurt the retrieval performance by 4.70% and by 1.10% 
for the title-only and the long queries, respectively (see Table 3a). 



Table 3a. Average retrieval precision of the German monolingual queries, the 
German translation of the equivalent English queries, the translation queries after 
applying the term-disambiguation technique, and the translation queries after 
applying the term disambiguation and query expansion techniques 



Query 


Title 


Long 


Monolingual 


0.2075 


0.2790 


Trans (EG) 


0.0401 (-80.67%) 


0.0284 (-89.82%) 


Trans (EG) -I- Dis 


0.0907 (-56.28%) 


0.0962 (-65.50%) 


Trans (EG) -I- Dis -I- QE 


0.0810 (-60.98%) 


0.0932 (-66.60%) 
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4.2 French Result 

The English queries that were translated into French (EF) consist of 552 terms for the 
title-only queries and 8,292 terms for the long queries. Ideally, as for the equivalent 
French monolingual queries, the resulting translation queries must contain 198 terms 
and 2,324 terms for the title-only and the long queries, respectively. 

The EF title-only queries perform 66.80% below the equivalent monolingual 
French queries. Applying the term disambiguation technique improved the retrieval 
performance by 15.32%. The EF long queries drop the performance by 89.44%, as 
compared to the equivalent monolingual French queries. The term disambiguation 
technique improved the retrieval performance by 27.17%. As with the German 
translation queries, the query expansion technique hurt the retrieval performance by 
18.19% and by 1.54% for the title-only and the long queries, respectively (see Table 
3b). 



Table 3b. Average retrieval precision of the French monolingual queries, the 
French translation of the equivalent English queries, the translation queries after 
applying the term-disambiguation technique, and the translation queries after 
applying the term disambiguation and query expansion techniques 



Query 


Title 


Long 


Monolingual 


0.2260 


0.2682 


Trans (EF) 


0.0750 (-66.80%) 


0.0283 (-89.44%) 


Trans (EF) -f Dis 


0.1097 (-51.48%) 


0.1012 (-62.27%) 


Trans (EF) -f Dis -f QE 


0.2682 (-69.67%) 


0.0971 (-63.81%) 



4.3 Italian Result 

The English queries that were translated into Italian (El) consist of 362 terms for the 
title-only queries and 3,259 terms for the long queries. Ideally, as for the equivalent 
Italian monolingual queries, the resulting translation queries must contain 173 terms 
and 2,172 terms for the title-only and the long queries, respectively. 

The El title-only queries perform 57.91% above the equivalent monolingual Italian 
queries. Applying the term disambiguation technique improved the retrieval 
performance by 6.1%. The El long queries drop the performance by 36.47%, as 
compared to the equivalent monolingual Italian queries. The term disambiguation 
technique improved the retrieval performance by 18.53%. As with the previous 
languages, the query expansion technique hurt the retrieval performance by 64.14% 
and by 9.78% for the title-only and the long queries, respectively (see Table 3c). 
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Table 3c. Average retrieval precision of the Italian monolingual queries, the Italian 
translation of the equivalent English queries, the translation queries after applying 
the term-disambiguation technique, and the translation queries after applying the 
term disambiguation and query expansion techniques 



Query 


Title 


Long 


Monolingual 


0.0347 


0.1279 


Trans (El) 


0.0548 (+57.91%) 


0.0813 (-36.47%) 


Trans (El) + Dis 


0.0569 (+64.01%) 


0.1050 (-17.94%) 


Trans (El) + Dis + QE 


0.0204 (-41.21%) 


0.0925 (-27.72%) 



Overall, applying the term disambiguation technique improved the retrieval 
performance of the translation queries in the three languages by 6%-24%. However, 
the query expansion technique did not help improve the retrieval performance, and 
instead, made the retrieval performance worse by l%-64% by adding terms related 
translation queries that are incorrect in the first place, thus, adding terms that are not 
relevant to the original queries. The major cause of the poor translations is the fact 
that there were many terms that could not be found in the bilingual dictionaries. We 
hope that that the next time we will be able to use better machine-readable 
dictionaries. 

From the result for each language, we learned that the queries for each language 
translation performed 27%-69% below the equivalent monolingual queries. Our rank- 
list merging algorithm assumes that the most relevant document from the monolingual 
retrieval in English is as relevant as that in any of the cross-lingual retrieval in the 
other languages. Since this assumption was not true in most of the cases, the resulting 
merged rank lists contain relatively large number of irrelevant documents, as 
compared to the number of relevant ones. We plan to use a better rank-list merging 
algorithm in the future. 



5 Summary 

The field of cross-language information retrieval (CLIR) research still poses many 
challenges to be solved by its researches. Work has been done to demonstrate that the 
sense ambiguity and phrase translation problems in the translation process can be 
solved using statistical or linguistic approaches. Moreover, to deal with different 
languages, one needs to take into consideration word-formation patterns specific to 
each language, such as compound word forms in German. Another main research issue 
is the merging of retrieval results from multilingual document collections. Finally, for 
a dictionary-based CLIR query translation to be effective, it requires a comprehensive 
and good quality dictionary. 
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Abstract. The application of NLP techniques to improve the results of 
information retrieval is still considered as a controversial issue, whereas 
in cross-language information retrieval (clir) linguistic processing is al- 
ready well established. In this paper, the clir component - Mpro-IR - 
which is presented has been developed as the core module of a multi- 
lingual information system in a legal domain. This component uses not 
only the lexical base form for indexing but also derivational information 
and, for German, information about the decomposition of compounds. 
This information is provided by a sophisticated morpho-syntactic anal- 
yser and is exploited not only for query translation but also for query 
expansion as well as the search and the document ranking. The objective 
of the CLEF evaluation was to assess this linguistic based retrieval ap- 
proach in an unrestricted domain. The focus of the investigation was on 
how derivation and decomposition can contribute to improve the recall. 



1 Introduction 

The Mpro-IR system is a clir system based on query translation and focuses 
rather on a better recall than on a balanced recall and precision. To improve the 
recall, the system tries to take advantage of a sophisticated linguistic processing 
component whose results are used in the monolingual retrieval modules. Based 
on the output of a morpho-syntactic analysis which provides the full range of 
morphological information, not only inflection which would correspond to the 
power of a stemmer such as the Porter stemmer but also derivational and de- 
composition of compound nouns is exploited. This information is used for the 
indexing, query expansion, search and document ranking. The translation com- 
ponent takes additional advantage of the part-of-speech provided as well as of 
the syntactic structure of the source query. Section 2 gives a short overview on 
how this information is obtained and exploited in the system. 

For CLEF 2000, as a first time evaluation within the trec framework, we did 
one official run mainly to test Mpro-IR in an unrestricted domain. We carried 
out the retrieval by querying only the English title section of the topics and using 
a retrieval component especially developed for phrase search in a legal domain, 
i.e. the whole phrase has to occur in the same sentence. But the main aim 
was to investigate whether derivational information and decomposition of nouns 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 1 ( 16- 17771 2001. 
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could contribute to a better recall. As discussed in Section 3, the restrictions of 
Mpro-IR’s phrase search are too strong to obtain a satisfactory performance. 
They do not even allow a final conclusion as to whether the application of the 
additional linguistic information improves the recall or not. 



2 Mpro-IR System Description 

The CLIR component Mpro-IR has been developed as the core component of 
a multilingual web-based information system on European Media Laws (emis). 
The document base is multilingual: there are documents in German, English, 
and French. For these languages, an interface is available that enables the users 
to enter their queries in the selected language. The design of Mpro-IR is guided 
according to the requirements that an information retrieval system in a legal 
domain has to satisfy: it has to support the lawyers’ work which means find- 
ing as much information as possible about a certain subject. In terms of IR, 
the retrieval component should provide the best possible recall. The design of 
the system also had to take into account that the domain is relatively new and 
neither a thesaurus nor an approved term list is available, thus queries using 
an uncontrolled vocabulary are usual. In addition, the type of queries used has 
some impact on the design: the system has to be capable of processing single 
word queries such as advertising, compound terms as subliminal advertising, as 
well as complex phrases like actions leading to competition distortions, private 
broadcasters’ obligation to provide information, ... In the legal domain, such 
phrases often have to occur within one sentence to be relevant, therefore a spe- 
cial phrase search component has been developed which searches the input query 
within this restricted space. However, to allow the search of each of the meaning 
bearing terms within a whole document, a traditional Boolean search facility is 
also provided to the users. 

Independently of the search facility used, the input query as well as the doc- 
uments undergo a linguistic processing to take advantage of the information 
provided. 

The Linguistic Processing 

Stemming is the NLP technique which is frequently used and successfully applied 
in IR systems. A standard tool is the Porter stemmer |Zj which achieves a normal- 
isation by simply chopping off suffixes. Such stemmers have serious deficiencies, 
for instance general is mapped to gener, and distribute to distribut, neither of 
which are lexical base forms, which thus leads to improper conflations. To over- 
come some of these problems, advanced stemmers are developed and combined 
with a lexicon ^ to verify the identified stem. This approach produces far better 
results. It avoids the type of error shown above but others, such as the mapping 
of distributed to distribut, still occur. In this case, the word distributed cannot be 
found in the dictionary. Irregular plural {media/medium) or declination forms 
{went/go) also cause errors. The main drawback of this approach lies thus in the 
coverage of the lexicon. 
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For languages with a rich declensional morphology such as French or Ger- 
man, the results of such stemming are rather unsatisfying because considering 
only inflection (or even suffix reduction) is not enough (cf. u,m- For instance, 
the stemming of the German past participle gegangen (gone) to gang results in 
a wrong form (the correct one is gehen/ to go). German verbs as well as French 
verbs such as alien (to go) or recevoir (to get) have numerous forms which makes 
it almost impossible to stem them by using suffix algorithms. For German, in 
addition, the compound formation leads often to failures because of the under- 
lying highly productive morphological process (cf. [^). 

In Mpro-IR, the Mpro programme package 0 developed at lAI is used for 
the linguistic processing, and its major features will be described in the following. 
Mpro has been primarily developed to process the German language but is now 
available for different languages (including Eastern European languages). How- 
ever, the same level of functionality as the German module is not available for all 
language modules. Mpro performs a morpho-syntactic analysis consisting of a 
lemmatisation, a part-of-speech tagging and, for German, a compound analysis 
as well as, optionally, an additional syntactic and semantic disambiguation eval- 
uating mainly context information. For the reduction of syntactic ambiguities, 
there is also a shallow parsing component available for each language. 

The morpho-syntactic analysis is combined with a look-up in a word-form 
dictionary. In a first step, the word-forms are looked up in a special tagging 
dictionary, for which an entry looks as follows: 

■[string=Word-form,c=w,sc=CAT,lu=Citation-form, . . .I 

where CAT is the category. Nouns, verbs, adjectives, and derived adverbs are 
looked up in a morpheme lexicon. This morphological dictionary contains allo- 
morphs but also some irregular word- forms which cannot be identified in another 
way as well as variety of toponyms and other names. Each entry shows how the 
associated stems behave morphologically, as shown in the examples below: 

■[string=corrupt , c=a,n=fness=quality}}- 

■[string=corrupt , c=v , n={ion=massnalime}- , a={ible=able} , 
t={c=v , double=no , end=s , f unct=noI} 

To reduce overgeneration we can also prohibit prefixes or certain nonsensical 
compounds. 

For each word-form, the morphological analyser produces at least one description 
which is represented as an attribute-value pair. In the following, the analyses of 
the English noun corruption, the verb corrupt, and adjective corrupting are given 
(only the features of interest are shown) : 



{str ing=corrupt ion , lu=corrupt ion , ds=corrupt "" ion , ts=corrupt ion , 
ls=corrupt ,t=corruption, c=noun, s=massnahme , . . .I 

{string=corrupt , lu=corrupt ,ds=corrupt , ts=corrupt , ls=corrupt , 
t=corrupt , c=adj ,...}■ 
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{string=corrupt , lu=corrupt ,ds=corrupt , ts=corrupt , ls=corrupt , 
t=corrupt , c=verb, . . .} 

{str ing=corrupt ing , lu=corrupt ing , ds=corrupt ~ ing , ts=corrupt ing , 
ls=corrupt ,t=corrupting, c=noun, s=vn, . . .} 

{or i=corrupt ing , lu=corrupt ing , ds=corrupt ~ ing , t s=corrupt ing , 
ls=corrupt ,t=corrupting, c=adj 

{string=corrupting,lu=corrupt ,ds=corrupt , ts=corrupt , ls=corrupt , 
t=corrupt , c=verb, . . .} 

The feature ds contains the morphological derivation, and Is the respective nor- 
malised form. The features s and ss (for compounds) contain semantic informa- 
tion. In the example above, all three words have the same derivation. For German 
words, a compound analysis is performed additionally (cf. example below), and 
the result is given in the feature ts and its normalised fornfl in feature t. These 
features are also assigned for English and French analyses but correspond always 
to the lu feature. 

Due to a special treatment some defective noun constructions in German 
- such as these occurring in coordinations like Informations- und Kommunika- 
tionsdienst (Information and Gommunications services) - are recognised. Mpro 
assigns the missing head information by using a lookahead algorithm: 



{ str ing=Informat ions- , lu=informationsdienst,ts=informations^dienst, 
t=information95^dienst,ds=informieren'ation^dienst, 
ls=informieren^dienst,c=noun, . . .} 

{string=und, lu=und, c=w, . . .} 

{string=Kommunikationsdienst , lu=kommunikationsdienst , 

ts=kommunikations#dienst , t=kommunikation#dienst , c=noun, . . .} 

Although Mpro is very complete, a strategy for handling unknown words is 
provided. Three cases can be differentiated: 

— The word- form can not be analysed at all: 

Mpro marks this word with the feature state=unknown and classifies the 
word as ’noun’, for instance 

{ string=settlor , lu=settlor , ds=settlor , state=unknown , c=noun , s=n} 

— The word-form can partly be analysed: 

Mpro tries in each case to assign the most appropriate information. For 
instance: If a string consists only of numbers such as 1864 the word get as 
category cardinal number (c=z), and Mpro provides an analysis whereas 
the value of the lexical unit is identical with the string: 

{ string=1894 , ds=1894 , ls=1894 , c=z,lu=1894 , s=year} 

— The word form is analysed but not found in the lexicon: 

Strings which consist only of capital letters such as CNN are marked as 

^ Hyphens and German ’fuge’ elements are removed. 



170 Barbel Ripplinger 



acronyms, and have as the part-of-speech c=noun: 

{ str ing=CNN , lu=CNN , ds=CNN , ls=CNN , c=noun,s=acronym} 

The analyser recognises lexicalised multiword units such as look up, United 
States, German prefix verbs, for isntance mitteilen, fixed expressions such as in 
Bezug auf, de facto, abbreviations like etc., i.e. as well as proper names such as 
Bill, Berlin. 

After this analysis, for German the output can be further disambiguated by eval- 
uating context information, i.e. if the first letter of the word-form is capitalised, 
and the word is not the first in a sentence, it must be a noun. In a final step, a 
shallow parsing can be applied to reduce other syntactical ambiguities such as 
verb/noun readings. This parsing process can also be performed for the English 
and French output of the morphological analysis to get an almost unambiguous 
representation. Mpro does not reduce ambiguity where the correctness of the 
decision is doubtful. 

The remainder of this section describes how the results of the morpho-syntactic 
analysis are applied for various stages of the IR process. 

The Retrieval 

For indexing, query expansion, and the search together with a document ranking, 
the information provided by the features lu. Is as well as t (currently for German 
only) is exploited. 

Based on the analyses of the documents, several indices are built: one using 
the information about the lexical unit (i.e. the normalised form), one using the 
derivational information, and for German a third index is constructed with the 
decomposition information. Though English and French nouns have a t-feature, 
we have not exploited this kind of information because this information is subject 
to an ongoing revision of the English and French morpheme lexicon (see above) . 
With each key, the document identification number, the sentence number (snr), 
the word number (wnr), as well as the word- form (the form of the word as 
occuring in the text) are stored. Function words (entries with c=w) are discarded 
from the indexing. This process is done within a preparation phase. 

At search time, the queries are processed by the same morpho-syntactic analysis 
as the documents. For the monolingual search, the function words are removed 
from the analysis output and, for the meaning bearing words, the values of the 
lu-. Is- and, for German queries, the t-feature are extracted to construct a set 
of search patterns. For the input query. Competitiveness of European industry 
the set of search terms consists of competitiveness, compete, european, europe, 
industry. 

For the cross-language retrieval, we decided to translated the queries and to carry 
out a monolingual search afterwards. This approach seems more appropriate be- 
cause legal information is highly related to the original wording, and machine 
translation systems provide only a poor quality | 2 |. The input to the transla- 
tion component is the complete morphological analysis of the query. Mpro-IR 
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uses a shallow translation tool which performs a lexical transfer based on huge 
transfer lexicons (coverage of the English-German lexicon is about 500.000 en- 
tries) comprising single words, abbreviations, compound terms but also fixed 
phrases. For multiword units, the MT-component first looks up whether the dic- 
tionary contains a translation for the whole phrase. If no translation exists, the 
phrase is translated compositionally whereas the translation is guided by the 
part-of-speech, i.e. for verbs only the translations for verbs are assigned. The 
translation output undergoes a shallow parsing based on a phrase grammar to 
get only one possible translation whereas the syntactic representation of the 
source is taken into account. For German as target language, the syntactic vari- 
ants of a term are additionally sorted out. For example, there are two entries in 
the English-German dictionary for human dignity, Menschenwiirde and Wilrde 
des Menschen. In these cases, the compound is preferred because, due to the 
query expansion, all occurrences of the syntactic variant Wiirde des Menschen 
are equally found but the search for a compound is much faster than that for a 
phrase. 

The search itself consists of several look-ups in the different indices; for each 
content bearing term the following look-ups are made: 

1. Looking up the index built over the lexical base forms (lu-index) with the 
value of the lu-feature 

2. For German only: Looking up the index built over the t-feature (t-index) 
with the value of the t-feature to find compounds with the queried term as 
element 

3. Looking up the index built over the derivations (Is-index) with the value of 
the Is-feature 

For compounds, the different formation in English and French compared to Ger- 
man leads to a different search strategy. Bearing in mind that open compound 
terms in English and French have almost a fixed word order, we defined a dis- 
tance factor to decide whether the occurrence of two or more words represents an 
open compound or not. Based on statistical data, the longest distance between 
each meaning bearing word of a phrase is fixed to 3. This allows us to classify 
occurrences of advertising in UK’s television as exact hits of television advertis- 
ing. For English as well as for French compounds, the occurrences of each word 
within a phrase is evaluated against this distance factor using the word number 
provided by the index, and sorted into the following three lists: 

1. The lu- values looked up in the lu-index of each element occur within the 
determined distance. 

2. At least for one element only the derivation occurs within this distance. 

3. All other occurrences. 

We apply this distance measure also to German to find syntactic variants of 
compound terms: 
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1. Looking up the lu-index with the values of the t- and Is-features of the 
single compound elements. This retrieves documents containing the syntactic 
variants of the input compound, for instance searching for Verbraucherschutz 
(Consumer protection) hits zum Schutz der Verbraucher as well as um die 
Verbraucher zu schiitzen. 

2. Looking up the lu-index with the value of the t- and Is- features whereas the 
parts of the compounds occur outside the environment. 

3. Looking up the Is-index with the values of the t- and Is-features of the 
compound parts. 

This produces a list of documents containing semantically similar terms. 
These are terms which point to a common concept in a virtual hierarchy 
(i.e. all elements of the ’transitive closure’ of the particular concept denoted 
by the compound) . For instance, the search for Verbraucherschutz found hits 
such as Schutzbestimmungen bezuglich der Verbraucherdaten (regulation to 
protect consumer data). 

For phrases, the topmost result list consists of documents which contain the 
elements of the phrase exactly (excluding function words). The next list con- 
tains documents in which at least one phrase element occurs only as part of a 
compound. All further result lists are analogously calculated. 

Usually the rank of a retrieved document is computed by the tf*idf. Using 
a weight based on frequency seems inadequate in this environment of a legal 
domain in which some terms occur only as parts of bigger compounds, or in 
different parts-of-speech. Thus, in Mpro-IR, the documents are ranked by the 
information used to retrieve them, in the order of the lists described above. This 
ranking mirrors the relevance related to the reliability of the linguistic informa- 
tion used to retrieve a document. A document retrieved by stem information is 
more relevant to the query then a document retrieved by derivational informa- 
tion. It expresses the degree of precision of the retrieval at that time. The results 
of the first list have a higher precision than those of the lower lists because the 
probability that mismatched documents are retrieved increases. 



3 Mpro-IR in CLEF 

We participated for the first time in a clef / trec evaluation to investigate how 
Mpro-IR developed for a special domain fares with unrestricted documents re- 
lated to recall and precision. 

Setting up the Experiment 

Currently the Mpro-IR system covers only German, English, and French. To 
perform clef’s CLIR task which additionally comprises the search in Italian doc- 
uments, we integrated a small Italian component into Mpro-IR. To provide a 
sufficient coverage for this module, we analysed the complete Italian topics (ti- 
tles, description, and narratives), and added unknown words (morphemes) to 
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our monolingual lexicon. For the translation component, we added only transla- 
tions for the words occuring in the title sections of the topics. Thus the Italian 
morpheme lexicon has now 27.800 entries compared, for instance to the English 
morpheme lexicon with about 48.300 entries. We used English topics and re- 
trieved documents in English, French, German, and Italian; therefore we added 
missing translations for the terms of the topic titles to the respective transfer 
dictionaries. 

Retrieval Performance 

Due to time and space restrictions we could perform and submit only one run. 
Therefore we decided to perform a phrase search only over the titles sections of 
the topics, although we noticed that the type of queries was not always adequate 
for this kind of search. To build up the indices, texts were normalised, i.e. we 
discarded all header and other formatting information including some of the 
title sections which led in some cases to a lower performance due to missing text 
parts. 

The overall result of the clef evaluation shows a low retrieval performance 
of Mpro-IR compared to the other systems. Taking into account that a very 
restricted retrieval component has been used - all meaning bearing words have to 
occur in the same sentence, and only one translation is used - the outcome is not 
too bad. The results show more or less what we expected: For topics which are 
incomplete sentences such as French conscientous objector, supermarket ceiling 
in Nice collapses, etc. we got none or only a few results (cf. Figure 1 lAil). 




For topics such as European Economic Area, World Trade Organisation etc. the 
results are better though not satisfactory. 
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Our main objective was to evaluate the use of derivational and decompositional 
information to improve the recall. Thus, we could conclude that most of the 
documents are retrieved by using the information of the lexical base form. Only 
a few others are retrieved on the basis of derivational information. Decomposi- 
tion information which is only used for retrieving German documents depends 
on the type of compounds, and in a few cases also on the type of the single 
words forming a compound. No relevant occurrences of syntactic variants are 
found in the corpus. We also got only a few results on the basis of the produc- 
tive use of decomposition information, i.e. documents containing semantically 
similar terms. The main reason is certainly the restricted search space, fur- 
thermore the German compounds occurring in the queries (such as Kriegsdien- 
stverweigerer, Krehsgenetik Golfskriegssyndrom, Nobelpreis, Alkoholkonsum , . . .) 
consist of words which are not frequently used in compound formation within 
the context of the respective query. Another reason is that only one translation 
is used (ex: Methane deposit is translated into German as Methanlagerstdtte 
whereas in the documents the synonym Methanlager is often used) . 

To get an impression to what extent the restriction to a sentence as search 
space is too strong, we performed a second unofficial run. The result (iAi2 in 
the figure above) shows an overall improvement of the average precision of 50%, 
and an almost three times higher recall (425 vs. 1168 relevant documents). We 
also obtained more hits using decomposition and derivation information. There 
are also some relevant documents found on basis of semantically similar terms. 

4 Conclusion 

The results of the clef evaluation correspond with those we got from the eval- 
uation of the retrieval algorithm within the emis system mg. Also here most 
hits could be retrieved by using precise lexical base forms and derivational in- 
formation. Gompositional information was also valuable for detecting syntactic 
variants of German compounds. The improvement of the recall by so-called se- 
mantically similar terms is very poor. Because this approach is also very time 
consuming, we will defer this in favour of a better morpho-syntactic analysis. 
This will then provide the basis for a better indexing by using a term recogni- 
tion component, and a better translation component. 

For the query expansion on the monolingual side, we currently experiment with 
a method to add synonyms which will be automatically computed by translating 
the translations back to the source language. The search itself could be improved 
by taking advantage of the part-of-speech together with the semantic information 
already provided by the morpho-syntactic analyser j^. 

As the results here show, the phrase search as implemented in Mpro-IR is use- 
ful in retrieval systems developed for a special type of domain where the search 
of complex phrases is necessary, such as the legal domain. In retrieval systems 
dealing with unrestricted texts, a Boolean search achieves much better recall. As 
the unoffical run shows, with a Boolean search we could certainly get a better 
insight into the usefulness of derivational and compositional information in the 
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retrieval process due to the higher recall. Additionally, there is some potential 
to improve the precision which we have neglected so far in favor of a high recall 
by exploiting number and case agreement, for instance. 

The approach we pursue in Mpro-IR using a sophisticated morpho-syntactic 
analysis has shown that the recall can be improved by more precise identifica- 
tion of the lexical base units and the almost unambiguous representation of the 
documents and the queries. The possible impact of derivational and decomposi- 
tional information has to be further evaluated. Results from the clef experiment 
have no significance so far. However, part-of-speech, currently exploited only for 
translation purpose together with semantic information, can be expected to con- 
tribute to a better retrieval performance, which still has to be shown. 
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Abstract. The University of Maryland participated in the CLEF 2000 
multilingual task, submitting three official runs that explored the impact 
of applying language-independent stemming techniques to dictionary- 
based cross-language information retrieval. The paper begins by describ- 
ing a cross-language information retrieval architecture based on balanced 
document translation. A four-stage backoff strategy for improving the 
coverage of dictionary-based translation techniques is then introduced, 
and an implementation based on automatically trained statistical stem- 
ming is presented. Results indicate that competitive performance can be 
achieved using four-stage backoff translation in conjunction with freely 
available bilingual dictionaries, but that the the usefulness of the statis- 
tical stemming algorithms that were tried varies considerably across the 
three languages to which they were applied. 



1 Introduction 

One important goal of our research is to develop cross-language information 
retrieval (CLIR) techniques that can be applied to new language pairs with 
minimal language-specific tuning. So-called “dictionary-based” techniques offer 
promise in this regard because bilingual dictionaries have proven to be a useful 
basis for CLIR jS| and because simple bilingual dictionaries are becoming widely 
available on the Internet . Although bilingual dictionaries sometimes include use- 
ful information such as part-of-speech, morphology and translation preference, 
it is far more common to find a simple list of translation equivalent term pairs — 
what we refer to as a “bilingual term list.” The objective of our participation 
in the Cross-Language Evaluation Forum (CLEF) was to explore techniques for 
dictionary-based CLIR using bilingual term lists between English and other Eu- 
ropean languages. We applied techniques that we have used before (balanced 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 176-ESl 2001. 
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document translation, described below), and chose to focus our contrastive runs 
on improving translation coverage using morphological analysis and an unsuper- 
vised morphological analysis approach that we refer to as “statistical stemming.” 
In the next section we describe our balanced document translation architecture, 
explain how morphological analysis can be used to improve translation cover- 
age without additional language-specific resources, and introduce two statistical 
stemming algorithms. The following section presents our CLEF results, which 
demonstrate that the additional coverage achieved by four-stage backoff transla- 
tion can have a substantial beneficial effect on retrieval effectiveness as measured 
by mean average precision, but that our present statistical stemming algorithms 
perform well only in French. In the final section we draw some conclusions regard- 
ing the broader utility of our techniques and suggest some additional research 
directions. 

2 Experiment Design 

We chose to participate in the multilingual task of CLEF 2000 because the 
structure of the task (English queries, documents in other languages) was well 
matched to a CLIR architecture based on document translation that we have 
been developing. Document translation is an attractive approach in interactive 
applications if all queries are in a single language because the pre-translated doc- 
uments that are retrieved can immediately be examined by the user. Although 
storage overhead is doubled (if the documents are also stored in their original 
language), that may be of little consequence in an era of rapidly falling disk 
prices. The principal challenge in a document translation architecture is to bal- 
ance the translation speed and translation accuracy. In our initial experiments 
with document translation, we found that a commercial machine translation 
system required about 10 machine-months to translate approximately 250,000 
documents - resource requirements that would clearly be impractical in many 
applications |S|. With simpler techniques, such as looking up each word in a 
bilingual term list, we can translate a similar number of documents in only 
three machine-hours — a period of time comparable to that required to build an 
inverted index. In our CLEF experiments we have thus chosen to focus on im- 
proving the retrieval effectiveness of dictionary-based CLIR without introducing 
a significant adverse effect on translation efficiency. 

Figure E illustrates our overall CLIR system design. Each non-English col- 
lection was processed separately using the appropriate bilingual term list. We 
grouped the articles from Der Spiegel and Frankfurter Rundschau into a single 
German collection and formed a French collection from the Le Monde articles 
and an Italian collection from the La Stampa articles. The documents were nor- 
malized by mapping all characters to lower case 7-bit ASCII through removal 
of accents. Term-by-term translation was then performed, applying a four-stage 
backoff statistical stemming approach to enhance translation coverage. For trans- 
lation, we tokenized source-language terms at white space or terminal punctua- 
tion (which had the effect of ignoring all source-language multiword expressions 
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in our bilingual term lists) . When no translation was known for a clitic contrac- 
tion, automatic expansion was performed (e.g. I’heure — > la heure) and the result- 
ing words were translated separately^ Other words with no known translation 
were retained unchanged, which is often appropriate for proper names. We pro- 
duced exactly two English terms for each source-language term. For terms with 
no known translation, the untranslated term was generated twice. For terms with 
one known translation, that translation was generated twice. Terms with two or 
more known translations resulted in generation of each of the “best” two trans- 
lations once. In prior experiments we have found that this strategy, known as 
“balanced translation,” outperforms the still fairly common (unbalanced) tech- 
nique of including all known translations because it avoids overweighting terms 
that have many translations (which are often quite common, and hence less 
useful as search terms) 0. 



Documents 




Ranked List 



Bilingual Term Lists 

Fig. 1. Production of a merged multilingual ranked list using document trans- 
lation. 



Each of the four resulting English collections (the fourth consisting of Los 
Angeles Times articles, which did not require translation) was then indexed 
using Inquery (version 3.1pl), with Inquery’s kstem stemmer and default English 
stopword list selected. Queries were produced by enclosing each word in the title, 
description, and narrative fields (except for stop-structure) in Inquery’s ^sum 
operator. In our official runs, two types of stop-structure were removed by hand: 
“find documents” was removed at the beginning of any description field in which 
it appeared, and “relevant documents report” was removed at the beginning of 

^ Clitic contractions are not common in German, so we did not run the splitting 
process in that case. 



CLEF Experiments at Maryland 179 



any narrative field in which it appeared. Because this stop structure was removed 
manually after examining the queries, our runs are officially classified as being in 
the “manual” category. We generated separate ranked lists for each collection and 
then used the weighted round-robin merging technique that we had developed 
for the TREC CLIR track to construct a single ranked list of the top 1000 
documents retrieved for each query [7]. We expected our (monolingual) English 
system to outperform our French and German systems, and we expected our 
Italian system to be adversely affected by the small size of the bilingual term 
list for that language pair. We thus chose a 10:5:5:3 ratio as the relative weights 
for each language. 

We used the same bilingual term lists for CLEF 2000 that we had employed 
in the TREC-8 CLIR track |Zj. Table [0 shows the source and summary statis- 
tics for each dictionary. Source language terms in the bilingual term lists were 
normalized in a manner similar to that used for the documents, although clitic 
contractions were not split because they were not common in the bilingual term 
lists. Balanced document translation becomes unwieldy beyond two translations, 
so the number of translations for any term was limited to the two that most com- 
monly occurred in written English. All single word translations were ordered by 
decreasing unigram frequency in the Brown corpus (which contains many gen- 
res of written English), followed by all multi-word translations (in no particular 
order), and finally by any single word entries that did not appear at all in the 
Brown corpus. Translations beyond the second for any English term were then 
deleted; this had the effect of minimizing the effect of infrequent words in non- 
standard usages or misspellings that might appear in the bilingual term list. 



Pair 


Source 


English Terms 


non-English Terms 


Avg Translations 


E-G 


http://www.quickdic.de 


99,357 


131,273 


1.7 


E-F 


http:/ / WWW. freedict.com 


20,100 


35,008 


1.3 


E-I 


http:/ / WWW. freedict.com 


13,400 


17,313 


1.3 



Table 1. Sources and summary statistics for bilingual dictionaries. 



2.1 Four-Stage Backoff Translation 

The coverage problem in CLIR arises when the object being translated (in this 
case, a document), contains a term that is not known to the translation resource 
(in this case, the bilingual term list). Bilingual term lists found on the web 
often contain an eclectic mix of root forms and their morphological variants, 
and our experience with the TREC-8 CLIR track suggested that morphological 
analysis of terms contained in documents and bilingual term lists could discover 
plausible translations when no exact match is found. We thus developed a four- 
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stage backoff strategy that was designed to maximize coverage while limiting the 
introduction of spurious translations: 

1. Match the surface form of a document term to surface forms of source 
language terms in the bilingual term list. 

2. Match the morphological root of a document term to surface forms of 
source language terms in the bilingual term list. 

3. Match the surface form of a document term to morphological roots of 
source language terms in the bilingual term list. 

4. Match the morphological root of a document term to morphological 
roots of source language terms in the bilingual term list. 

The process terminates as soon as a match is found at any stage, and the known 
translations for that match are generated. Although this process may result 
in generation of an inappropriate morphological variant for a correct English 
translation, the use of English stemming in Inquery should minimize the effect 
of that factor on retrieval effectiveness. 



2.2 Statistical Stemming 

The four-stage backoff strategy described above poses two key challenges. First, 
it would require that an efficient morphological analysis system be available for 
every document language that must be processed. And second, the morphologi- 
cal analysis systems would need to produce accurate results on words presented 
out of context, as they are in the bilingual term list. This is a tall order, so we 
elected to explore a simplification of this idea in which morphological analysis 
was replaced by stemming. Stemmers are freely available for French and Ger- 
man0 and stemming has proven to be about as effective as more sophisticated 
morphology in information retrieval applications where (as is the case in our 
application) matching is the principal objective 0. In TREC-3, Buckley, et al. 
demonstrated that a simple stemmer could be easily constructed for Spanish 
without knowledge of the language by examining lexicographically similar words 
to discover common suffixes p. We decided to try to push that idea further, 
automating the process so that it could be applied to new languages without ad- 
ditional effort. We call this approach “statistical stemming,” since the stemmer 
is learned from the statistics of a text collection, in our case the collection that 
was ultimately to be searched. 

Statistical stemming is a special case of unsupervised acquisition of morphol- 
ogy, a specialized topic in computational linguistics. Of this work, the closest 
in spirit to our objectives that we know of is a program known as Linguis- 
tica 0. Linguistica examines each token in a collection, observing the frequency 

^ French and German stemmers are available as part of the PRISE informa- 
tion retrieval system, which is freely available from the U.S. National In- 
stitute of Standards and Technology. Stemmers for a broader collection of 
languages, including Italian, are also available from the Muscat project at 
http: / /open. muscat .com / developer / index.html 
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of stems and suffixes that would result from every possible breakpoint. An opti- 
mal breakpoint for each token is then selected by applying as a constraint that 
every instance of a token must have the same breakpoint and then choosing 
breakpoints for each unique token that minimize the number of bits needed to 
encode the collection. This “minimum description length” criterion captures the 
intuition that breakpoints should be chosen in such a way that each token is 
partitioned into a relatively common stem and a relatively common suffix. Lin- 
guistica is freely available but the implementation we used could process only 
about 200,000 words on a 128 MB Windows NT machine. This is certainly large 
enough to ensure that breakpoints will be discovered for most common words, 
but breakpoints might not be discovered for less common terms — quite possibly 
the terms that would prove most useful in a search. We therefore augmented 
Linguistica with a simple rule induction technique to handle words that were 
outside Linguistica’s training set. 

We implemented rule induction as follows. We first counted the frequency of 
every one, two, three and four-character suffix that would result in a stem of three 
or more characters for the first 500,000 words of the collection. Each instance 
of every word was used to compute the suffix frequencies. These statistics alone 
would overstate the frequency of partial suffixes — for example, “-ng” is a common 
ending in English, but in almost every case it is part of “-ing” . We thus subtracted 
the frequency of the most common subsuming suffix of the next longer length 
from each suffix^The adjusted frequencies were then used to sort all two, three 
and four-character suffixes in decreasing order of frequency. We observed that 
the count vs. rank plot for an English training case was convex, so we selected the 
rank at which the second derivative of the count vs. rank plot was maximized as 
the limit for how many suffixes to generate for each length. In tuning experiments 
with English, this approach did not work well for single-character suffixes because 
the distribution of character frequency (regardless of location) is highly skewed. 
We thus sorted single characters by the ratio between their word-final likelihood 
and their unconditioned likelihood, and again used the maximum of the second 
derivative as a stopping point 1^ For each word, the first matching suffix (if any, 
from the top of the list) was then removed to produce the stemmed form. 

The heuristics we chose were motivated by our intuition of what constituted 
a likely suffix, but the details were settled only after a good deal of tweaking 
with a training collection. Of note, the training collection contained only English 
documents and the tweaking was done by the first author, who has no useful 
knowledge of French, German or Italian. Table 0 shows the suffix removal rules 
for those languages that were automatically produced with no further tuning. 
Many of the postulated suffixes in that table accord well with our intuition, as in 



® Linguistica is available at 

http://humanities.uchicago.edu/faculty/goldsmith/index.html 

We did not adjust the frequencies of four-character suffixes since we did not count 
the five-character suffixes. 

® If a more precise specification of the process is desired, the source code for the rule 
induction software is available from the first author. 
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the case the French adverbial suffix -merit or third-person plural inflectional suf- 
flx -ent. However, some others suggest insufficient generalization. Consider the 
suggested German suffixes: -ngen,-nden,-sen,-nen,-gen,-den, and -ten. The more 
appropriate suffix would be -en; however, the preference for longer subsuming 
strings selects the less general suffixes. A large number of single character suf- 
fixes are suggested for Italian, including letters such as -k and -w which do not 
typically appear in word- final position in this language. This somewhat coun- 
terintuitive set suggests that further optimization of threshold setting may be 
needed. 



French 


German 


Italian 


ment 


chen 


ione 


tion 


ngen 


ente 


ique 


nden 


ioni 


ions 


sche 


ento 


ent 


rung 


enti 


res 


lich 


ato 


tes 


sten 


are 


es 


ten 


to 


re 


ung 


ta 


X 


den 


re 


s 


gen 


ti 




nen 


no 




ter 


la 




sen 


y 




en 


0 




er 


e 




te 


a 




y 


k 




t 


i 






X 






w 



Table 2. Candidate stems, in order of removal. 



Three official runs were submitted. In our baseline run (“unstemmed”), we 
used no pre-translation stemming (i.e., step one alone). In our Linguistica run 
( “backoffdLing” ) , we implemented the complete four-stage backoff strategy us- 
ing Linguistica for terms with known breakpoints, and added a fifth stage that 
replicated stage four using the rule induction stemmer in place of Linguistica 
that would be invoked if none of the first four stages found a translation. The 
rule induction process was considerably faster than Linguistica (less than 5 min- 
utes, compared with 30-40 minutes for Linguistica) so we also submitted a third 
run in which which we implemented four-stage backoff with rule induction alone. 
Table 0 summarizes these conditions. 



CLEF Experiments at Maryland 183 





unstemmed 


backoff4Ling 


backoff4 


Stage 


Document 


Term List 


Document 


Term List 


Document 


Term List 


1 


None 


None 


None 


None 


None 


None 


2 






Linguistica 


None 


Rule Induction 


None 


3 






None 


Linguistica 


None 


Rule Induction 


4 






Linguistica 


Linguistica 


Rule Induction 


Rule Induction 


5 






Rule Induction 


Rule Induction 







Table 3. Backoff translation steps for the three official runs. 



3 Results 

Our backoffd run was judged, and all three runs were scored officially. The top 
line in Table 0| summarizes the results. Overall, a four-stage backoff document 
translation strategy using statistical stemming achieved an improvement in re- 
trieval effectiveness over the unstemmed approach that was found to be statisti- 
cally significant by a paired two-tailed t-test {p < 0.05 in both cases). Figure 0 
illustrates the advantage of backoff translation on a topic-by-topic basis. 





Unstemmed 


Backoff4 


Backoff4Ling 


Multilingual 


0.1798 


0.1952 


0.1938 


English 


0.4348 


0.4348 


0.4348 


French 


0.1877 


0.2823 


0.2649 


German 


0.2421 


0.2421 


0.2425 


Italian 


0.2127 


0.2045 


0.2022 



Table 4. Multilingual and language-specific mean uninterpolated average pre- 
cision, averaged over 40 topics. 



Surprisingly, our simple (and quite ad hoc) rule induction technique pro- 
duced results that were statistically indistinguishable from those obtained using 
the more sophisticated Linguistica system. As Figure 0 shows, Linguistica does 
better on some topics, but worse on others. 

As Figure 0 shows, on balance backoff translation with statistical stemming 
performed somewhat better than the the median of the submitted CLEF mul- 
tilingual runs in the automatic category. Since the effect of our limited manual 
stop-structure removal was likely quite small, we interpret these results as in- 
dicating that we have achieved a credible degree of retrieval effectiveness using 
only freely available linguistic resources. 

Although we can conclude from these results that four-stage backoff resulted 
in improved retrieval effectiveness and that statistical stemming appears to be 
a viable substitute for more sophisticated morphological analysis in this appli- 
cation, the multilingual task design can easily mask single-language effects. We 



Difference in Average Difference in Average Difference in Average 

Precision Precision Precision 
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Improvement (above axis) of Backoffd over BackofF4Ling. 
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Fig. 4. Comparison of Backoffd (better above axis) with median CLEF results. 
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therefore performed a post hoc language-specific analysis by segregating the se- 
lected documents and the relevance judgments by language and then scoring 
each ranked list against the appropriate relevance judgments. The ranked lists 
that we scored thus contained 130 (for Italian) or 217 (for French and German) 
documents from each language. The resulting mean uninterpolated average pre- 
cision values are shown in the lower portion of Table ^ For French, we found 
that that both implementations of backoff translation achieved a 55% relative 
improvement over the unstemmed case, and we found that result to be statisti- 
cally significant by a paired two-tailed t-test at p < 0.05. As Figure |S| illustrates, 
the magnitude of the improvement varies somewhat across topics, although some 
of the observed variation may be due to differences in the number of relevant 
topics for each document. 




Topic 



Fig. 5. Improvement (above axis) of backoff translation over unstemmed trans- 
lation for French documents. 



As Table 0 shows, no similar beneficial effect was observed from backoff 
translation in German or Italian. Many GLEF participants observed that it was 
important to split German compounds, something that we did not do. Further 
analysis of the German results may thus not be productive until we have given 
some thought to how backoff translation and statistical stemming might be inte- 
grated with automatic compound splitting. Our disappointing results in Italian 
might be explained by two possible causes. One possibility is that our statistical 
stemming techniques are not well suited to some characteristic of Italian. The 
alternative hypothesis is that our Italian-English bilingual term list (the smallest 
of the three that we used) may simply be too small. 

To explore this issue further, we conducted an additional set of post hoc 
experiments in which we substituted a freely available manually constructed 
rule-based Italian stemmer from the Muscat project^ for the Italian rule induc- 



http: / /open. muscat .com / developer / index.html 



186 



Douglas W. Oard, Gina-Anne Levow, and Clara I. Cabezas 



tion statistical stemmer. We call the new runs “BackoffdMuscat.” As Table El 
shows, BackoffdMuscat outperforms Backoffd, although the improvement over 
Backoffd is not statistically significant in either the multilingual of the Italian- 
specific case. The improvement of BackofF4Muscat over Unstemmed was found 
to be significant at p < 0.05 by a two-tailed paired t-test in the multilingual 
case, but not for Italian alone. We thus conclude that four-stage backoff transla- 
tion is helpful even with relatively small bilingual term lists, and that the poor 
performance of Backoffd for Italian results from a deficiency in the statistical 
stemmers that we used. 





Unstemmed 


Backoffd 


Backoff4Muscat 


Multilingual 


0.1798 


0.1952 


0.1994 


Italian 


0.2127 


0.2045 


0.2338 



Table 5. Comparison of backoff translation using statistical stemming and the 
Muscat Italian stemmer. 



4 Conclusion 

We have introduced two new techniques, four-stage backoff translation and sta- 
tistical stemming, and shown how they can be used together to improve retrieval 
effectiveness in a document translation architecture. Four-stage backoff transla- 
tion appears to help when using impoverished lexicons that contain a few tens of 
thousands of terms. Our initial experiments with statistical stemming produced 
promising results in French, but it is clear from our German and Italian results 
that more work is required before similar techniques can be reliably applied to 
a broader range of languages. 

Our experiments suggest a number of promising directions for future work. 
A new version of Linguistica is now available, and trying that is an obvious first 
step. A detailed analysis of the threshold selection in Italian for our rule induc- 
tion statistical stemmer is also clearly called for — a more appropriate threshold 
selection technique might produce far better results with little effort. Our orig- 
inal threshold selection strategy was chosen after inspection of English training 
data, but an alternative would be to learn a set of language-specific thresholds 
using test collections from the Text Retrieval Conference’s Cross-Language In- 
formation Retrieval track. Extending that line of reasoning, the correct answer 
might not be a single set of thresholds but rather an iterative technique that 
takes advantage of our ranking of possible stems. We might, for example, first 
stem using a conservative set of thresholds, and then restem more aggressively if 
no match is found. Finally, it might be productive to explore the middle ground 
between our simple rule induction stemmer and the full Linguistica system in 
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order to learn which techniques are particularly helpful in this application. Lin- 
guistica provides for fine-grained control over its operation, and exploring a range 
of possible parameter settings would be a first step in this direction. 

When coupled with other language-independent techniques such as blind rel- 
evance feedback for query expansion and for post-translation document expan- 
sion the techniques that we have explored in this work can potentially provide 
developers with a robust toolkit with which to design effective dictionary-based 
CLIR systems using only a bilingual term list and some modest query-language 
resources (specifically, a comparable collection from which to obtain term statis- 
tics) . This first CLEF evaluation has proven to be a suitable venue for exploring 
these questions, and we look forward to continued participation in future years. 
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Abstract. In this paper, we describe our approach in CLEF Cross-Language IR 
(CLIR) tasks. In our experiments, we used statistical translation models for 
query translation. Some of the models are trained on parallel web pages that are 
automatically mined from the Web. Others are trained from bilingual 
dictionaries and lexical databases. These models are combined in query 
translation. Our goal in this series of experiments is to test if the parallel web 
pages can be used effectively to translate queries in multilingual IR. In 
particular, we compare models trained on Web documents with models that also 
combine other resources such as dictionaries. Our results show that the models 
trained on the parallel web pages can achieve reasonable CLIR performance. 
However, combining models effectively is a difficult task, and single models 
still yield better results. 



1 Introduction 

In Cross-Language Information Retrieval (CLIR), the usual approach is to translate 
queries to the target language of the documents. One of the ways to perform query 
translation is to use a large set of parallel texts to train a statistical translation model. 
This approach has been successfully applied in previous CLIR experiments [4]. 
However, a possible obstacle is the lack of parallel texts for many language pairs. In 
order to overcome this obstacle, we conducted a research project to try to find parallel 
web pages automatically. In the past two years, we were able to build models for 
French-English and Chinese-English translations. Our results showed comparable 
performance to MT systems. 

This year, we successfully mined several sets of parallel Web pages for the 
following language pairs: English-Italian, English-German, in addition to the English- 
French corpus we found previously. Our goal in this year's CLEF experiments is to 
see if the parallel Web documents can also apply to multilingual IR. 

In our previous experiments, we observed that a certain combination of the 
translation models with a dictionary could improve IR effectiveness. However, the 
combination remained ad hoc: a dictionary translation is attributed a certain "default 
probability" and combined with translation words provided by a statistical translation 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 188-201, 2001. 
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model. In CLEF experiments, we tested a new combination method. First, a 
dictionary is transformed into a statistical translation model. This is done by 
considering a word/term and its translation words/terms as two parallel texts. Then 
different statistical translation models are combined linearly. The parameters of the 
combination are set so as to maximize the translation probability of held-out data. 

In this paper, we will first describe the mining system we used to gather parallel 
texts from the Web. Then a brief description of the training process of the statistical 
model will be provided. The CLEF experimental results will be reported. We provide 
some analysis of the translation process before the concluding remarks. 



2 Mining Parallel Texts from the Web 

Statistical models have often been used in computational linguistics for building MT 
systems or constructing translation assistance tools. The problem we often have is the 
unavailability of parallel texts for many language pairs. The Hansard corpus is one of 
the few existing corpora for English and French. For other languages, such a corpus is 
less (or not at all) available. In order to solve this problem, we conducted a text- 
mining project on the Web in order to find parallel texts automatically. The first 
experiments with the mined documents have been described in [5]. The experiments 
were done with a subset (5000) of the mined documents. However, we obtained a 
reasonably high CLIR performance. This experiment showed the feasibility of the 
approach based on parallel web pages. Later on, we trained another translation model 
with all the Web documents found, and the CLIR effectiveness obtained is close to 
that with a good MT system (Systran). 

The mining process proceeds in three steps: 

1 . selection of candidate Web sites 

2. finding all the documents from the candidate sites 

3. pairing the texts using simple heuristic criteria 

The first step aims to determine the possible web sites where there may be parallel 
texts for the given language pair. The way we did this is to send requests to some 
search engines, asking for French documents containing an anchor text such as 
"English version", "english", and so on; and similarly for English documents. The 
idea is, if a French document contains such an anchor text, the link to which the 
anchor is associated usually points to the parallel text in English (fig. 1). 



Page in French Page in English 




Fig. 1. Detection of candidate web sites 




190 



Jian-Yun Nie, Michel Simard, and George Foster 



From the set of documents returned by the search engines, we extract the addresses 
of web sites, which are considered as candidate sites. 

The second step also uses the search engines. In this step, a series of requests are 
sent to the search engines to obtain the URLs of all the documents in each site. In 
addition, as search engines only index a subset of all the web pages at each site, a host 
crawler is used to explore each candidate site more completely. This crawler follows 
the links in each web page. If a link points to another web page on the same site, then 
this page is added to the collection of web pages. In this way, many more web pages 
have been found. 

The last step consists of pairing up the URLs. We used some heuristic rules to 
determine quickly if an URL may be parallel to another: 

1. First, parallel texts usually have similar URLs. The only difference between them 
is often a segment denoting the language of the document. For example, "-en", 
e", and so on for English documents. Their corresponding segments for French are 
"-ft.", of the parallel URLs are shown below: 

Table 1. Examples of parallel URLs 

French page English page 

www.booksatoz.com/french/Museumf.htm www.booksatoz.com/Museum.htm 
www.c3ed.uvsq.fr/esee/french/state.htm www.c3ed.uvsq.fr/esee/english/state.htm 
www.gov.nh.ca/dot/adm/adminfhtm www.gov.nh. ca/dot/adm/admin.htm 
www.psac .com/ comp/upwe/upwe-f htm www.psac .com/ comp/upwe/upwe-e.htm 

www.psac.eom/comm/news/9602007f.htm www.psac.eom/comm/news/9602007e.htm 



Therefore, by examining the URLs of the documents, we can quickly determine 
which files may be a pair. 

2. We then use other criteria such as the length of the file to further confirm or reject 
a pair. 

3. The above criteria do not require downloading the files. Once a set of possible 
pairs is determined, the paired files are downloaded. Then we can perform some 
checking of the document contents. For example, are their HTML structures 
similar? Do they contain enough text? Can we align them into parallel sentences? 

The French-English parallel corpus was constructed last year at the RALI 
laboratory. This year, we cooperated with Twenty-One (W. Kraaij) to construct 
English-Italian and English-German parallel corpora, using the same mining system - 
PTMiner [2]. The following table shows the number of text pairs as well as volume of 
the corpora for different language pairs. 



Table 2. Training corpora 





E-F 


E-G 


E-I 


Pairs 


18,807 


10,200 




Volume (Mb) 


174 198 


77 100 
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The corpora found from the Web will be called WAC corpora (Web Aligned 
Corpora). The models trained with these corpora will be called the WAC models. 



3 Principle of Building a Probabilistic Translation Model 

Given a set of parallel texts in two languages, it is first aligned into parallel sentences. 
The criteria used in sentence alignment are the position of the sentence in the text 
(parallel sentences have similar positions in two parallel texts), the length of the 
sentence (they are also similar in length), and so on [3]. In [6], it is proposed that 
cognates may be used as an additional criterion. Cognates refer to the words (e.g. 
proper names) or symbols (e.g. numbers) that are identical (or very similar in form) in 
two languages. If two sentences contain such cognates, it provides additional evidence 
that they are parallel. It has been shown that the approach using cognates performs 
better than the one without cognates. Before the training of models, each corpus is 
aligned into parallel sentences using cognate-based alignment algorithm. 

Once a set of parallel sentences is obtained, word translation relations are 
estimated. First, it is assumed that every word in a sentence may be the translation of 
every word in its parallel sentence. Therefore, the more often a pair of words appears 
in parallel sentences, the better its chances of being a valid translation. In this way, we 
obtain the initial probabilities of word translation. 

At the second step, the probabilities are submitted to a process of Expectation 
Maximization (EM) in order to maximize the probabilities with respect to the given 
parallel sentences. The algorithm of EM is described in [1]. The final result is a 
probability function P(/[e) which gives the probability that / is the translation of e. 
Using this function, we can determine a set of probable word translations in the target 
language for each source word, or for a complete query in the source language. 



4 The Training of Multiple Models and Their Combination 

For English and French, we also have other resources: the Hansard corpus (a set of 
parallel French and English texts from the Canadian parliament debates), a large 
terminology database (Termium) and a small bilingual dictionary (Ergane). A 
translation model is trained from the Hansard data, in the same way as for the Web 
documents (WAC). 

In both the terminology database and the bilingual dictionary, we have English 
words/terms, and their French translations (words/terms). In some way, we can also 
think of these two resources as two sets of special parallel "sentences". Therefore, the 
translation probability between words can also be estimated with the same statistical 
training process. In this way, two additional translation models are estimated from 
them. In total, we obtain 4 different translation models between English and French 
from four different resources (in each direction). The question now is how we can 
combine them in a reasonable way. 

We choose a linear combination of the models. Each model is assigned a 
coefficient denoting our confidence in it. The coefficients are tuned from a set of 
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"held-out" data - a set of parallel sentences (about lOOK words), by using the EM 
algorithm to find values which maximize the probability of this data according to the 
combined model. This set is selected from different resources (distinct from those 
used for model training) so that it gives a good balance of different kinds of texts. 

Finally, the following coefficients are assigned to each model: 



Table 3. Parameters for linear combination of models 



Model 


Parameter 


Ergane 


0.041 


Hansard 


0.301 


T ermium 


0.413 


WAC 


0.245 



We observe that the combination seems to favor models with larger vocabularies. 
Termium is attributed the highest coefficient because it contains about 1 million 
words/terms in each language. The Hansard corpus and the WAC corpus contain 
about the same volume of texts. So their coefficients are comparable. The Ergane 
dictionary is a small dictionary that only contains 9000 words in each language. Its 
coefficient is very low The main reason for this is that the EM algorithm penalizes 
models which assign zero probabilities to target-text words, and models with small 
vocabularies will assign zero probabilities more often than those with large 
vocabularies. Therefore a larger model will usually be preferred over a smaller model, 
even though the translations it contains may not be as accurate. Although the 
coefficients we used are the best for the held-out data in the sense of maximizing its 
likelihood, they may not be suitable to our data in CLEF, and the maximum- 
likelihood approach may not be ideal in this context 



5 Experiments 

We used a modified version of SMART system [9] for monolingual document 
indexing and retrieval. The Itn weighting scheme is used for documents. For queries, 
we used the probabilities provided by the probabilistic model, multiplied by the idf 
factor. From the translation words obtained, we retained the top 50 words for each 
query. The value of 50 seemed to be a reasonable number on TREC6 and TREC7 
data. 



5.1 Monolingual IR 

Monolingual IR results have been submitted for the following languages: French, 
Italian and German. This series of experiments uses the SMART Itn weighting 
scheme for queries as well. In addition, a pseudo-relevance feedback is applied, which 
uses the 100 most important terms among the top 30 documents retrieved to revise the 
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original queries. The parameters used for this process are: a=0.75, and [3=0.25. The 
results obtained are shown below: 



Table 4. Monolingual IR effectiveness 





French 


Italian 


German 


> medium 


18 


18 


12 


< medium 


16 


16 


25 


Av.p. With feedback 


0.4026 


0.4334 


0.2301 


Av.p. Without feedback 


0.3970 


0.4374 


0.2221 



The comparisons with medium runs are only done for the submitted runs with 
pseudo-relevance feedback. As we can see, a great difference can be observed in 
effectiveness in the above runs. Several factors have contributed to this. 

1 . The use of stoplist 

In the case of French, a set of stopwords is set up carefully by French speaking 
people. In the case of Italian and German, we used two stop lists found from the Web 
[7]. In addition, a small set of additional stopwords was added manually for Italian. 

2. The use of a lemmatizer or a stemmer 

For French, we used a lemmatizer developed in the RALI laboratory that first uses 
a statistical tagger, then transforms a word to its citation form according to its part-of- 
speech category. For Italian and German, two simple stemmers obtained from the 
Web [8] are used. There is no particular processing for compound words in German. 
This may be an important factor that affected the effectiveness of German IR. 

Overall, the French and Italian monolingual runs seem to be comparable to the 
medium performance of the participants; but the German run is well below the 
medium performance. We think the main reason is due to the lack of special 
processing on German (e.g. compound words). 



5.2 Tests on Bilingual IR 

The bilingual task consists in finding documents in a language different from that of 
the queries. We tested the following bilingual IR: E-F (i.e. English queries for French 
documents), E-I and E-G. For this series of tests, we first used the translation models 
to obtain a set of 50 weighted translation words for each query. Unknown words are 
not translated. They are added into the translation words with a default probability of 
0.05. The same pseudo-relevance feedback process as that in monolingual IR is used. 

Between English and Italian, English and German, we only have the Web parallel 
documents to train our translation models. For French and English, we have multiple 
translation resources: the Web documents, the Hansard corpus, and two bilingual 
dictionaries. So we also compare the model with only the Web documents (the WAG 
model) and the model with all the resources combined (the Mixed model). The 
following table summarizes the results we obtained for bilingual IR. Only 33 queries 
have relevant documents, and are considered in these evaluations. 
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Table 5. Bilingual IR with different models 





WAC 


F-E 

Mixed 


_ I-E (WAC) 


G-E (WAC) 


> medium 


20 


16 


21 


13 


< medium 


13 


17 


13 


21 


Av.p. With feedback 


0.2197 


0.1722 


0.2032 


0.1437 


Av.p. Without feedback 


0.2410 


0.1728 


0.2102 


0.1456 



The runs we submitted are those with pseudo-relevance feedback. These runs are 
compared with Medium runs in the above table. For F-E and I-E cases, the WAC 
models perform better than the medium. The Mixed model of F-E gives a medium 
performance. The comparison between the two translation models for French to 
English is particularly interesting. We expected that the Mixed model could perform 
better because it is trained with more data from difference sources. Surprisingly, its 
effectiveness is worse than the WAC model alone. We see two possible explanations 
for this: 

- The combination of different resources is tailored for a set of held-out data that 
does not come from the CLEF document set. So there may be a bias in the 
combination. 

- During the combination, we observed that the combination results tend to favor 
dictionary translations. A high priority is attributed to dictionary translations. This 
may also be attributed to the biased tuning of combination. 

In Table 2, we showed that the I-E training corpus is smaller than both F-E and G- 
E corpora. However, the model trained with it seems to be better suited to our CLIR 
task than the G-E model. This may be due to two possible reasons. 

1. The quality of the translation model is determined by not only the size of the 
training corpus, but also the correspondence of the training data to the application 
corpus. 

2. The quality of the model is dependent on the languages and on the processing on 
them. 

In our case, the processing on German is the weakest. In particular, we did not 
consider compound words in German. This may have had a great impact on the 
trained model. It is also in translating German queries that we encountered the most 
unknown words, as we can see in Table 6. Quite a number of them are compound 
words such as "welthandelsorganisation", "elektroschwachtheorie" and 
"golfkriegssyndrom" . 



Table 6. Number of unknown words encountered by WAC models 



Model 


F-E 


I-E 


G-E 


Unknown words 


67 


30 


128 
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Another observation of Table 5 is that the pseudo-relevance feedback we used led 
to a general decrease in effectiveness, especially in the case of the WAC model for F- 
E. This may be due to the fact that the initial retrieval effectiveness is too low or the 
setting of the feedback parameters is not suitable. 

In the case of F-E CLIR, we tested several models separately. The following table 
shows the effectiveness between French and English using different translation 
models. It also compares the effectiveness with and without pseudo-relevance 
feedback. 



Table 7. Comparison of bilingual IR with different individual models 



Model 


WAC 


Hansard 


T ermium 


Mixed 


Av.p. Without feedback 


0.2410 


0.2869 


0.2182 


0.1728 


Av.p. With feedback 


0.2197 


0.2914 


0.2359 


0.1722 



We can see that the mixed model performed worse than any of the individual 
models. This indicates clearly that the combination of the models is not suitable for 
the CLEF data. Again, the pseudo-relevance feedback did not have a uniform impact 
on effectiveness. In the case of the mixed model, the impact is almost null. In the 
Hansard and Termium models, the impacts are positive, whereas in the WAC model, 
it is negative. 

This table clearly shows that the effectiveness in the official runs could be 
improved greatly by 1) a better relevance feedback process (or by removing this 
process), and 2) a better combination of models. 



5.3. Multilingual Runs 

In our case, the multilingual runs are only possible from English to all the languages 
(English, French, Italian and German). In these experiments, we followed these three 
steps: 

1. Translate English queries to French, Italian and German, respectively; 

2. Retrieve document from different document sets; 

3. Merge the retrieval results. 

The translation of English queries to German and Italian was done by the WAC 
translation model (trained from the Web documents). For English to French, we also 
have the alternative of using the Mixed model. The translation words are submitted to 
the mtc transformation of SMART. This scheme is chosen because it leads to 
comparable similarity values between results from different data sets, therefore, 
makes the result merging easier. The merging is done according to the similarity 
scores. The top 1000 retrieved are selected as the final results and submitted for 
evaluation. 

The following table describes the results of different runs. In the WAC column, all 
the models used to translate English queries are WAC models. In the Mixed case, 
only the English to French translation uses the Mixed model, whereas the other 
translations still use the WAC models. 
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Table 8. Multilingual IR effectiveness 





WAC 


Mixed 


> medium 


14 


12 


< medium 


26 


28 


Av.p. With feedback 


0.1531 


0.1293 


Av.p. Without feedback 


0.1548 


0.1544 



As we can see, these performances are all below the medium performance. One of 
the main reasons may be that the German monolingual retrieval does not use any 
linguistic preprocessing, and has a very poor effectiveness. This may greatly affect 
the multilingual runs. Another possible reason is the over-simplified merging method 
we used. In fact, in order to render the English monolingual runs compatible (in terms 
of similarity values) with other bilingual runs, we had to choose the mtc weighting 
scheme as for the other cases. In our tests, we observe that this weighting scheme is 
not as good as Itc. Therefore, the ease of result merge has been obtained at the 
detriment of English effectiveness. 

We observe again the negative impact of the Mixed model in this task. When the 
WAC model for English-French is replaced by the Mixed model, the effectiveness 
decreases. This shows once again that the coefficients we set for different models are 
not suitable for the CLEF data. 



6 Analysis of CLEF Results 

In analyzing the translation results, we observed several problems in query 
translation. 



6.1 Translation of Ambignons Words 

The translation models we constructed are IBM Model 1. These models do not 
consider the context during translation. It is a word-by-word translation; i.e. each 
word is translated in isolation. Therefore, they cannot solve word ambiguity in 
translation. For example, the word "drug" may be translated as "medicament" or 
"drogue" in French. These two senses are included in the translations of all the 
models, as we can see in Table 8. The same phenomenon is produced for the word 
"union" (in a query on European Union), which is translated to both "union" and 
"syndicat". 
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Table 9. Related translation words of 
"drug" 



Model 


Translation 


Prob. 


Hansard 


medicament 


0.1027 




drogue 


0.0464 




stupe fiant 


0.0042 


WAC 


drogue 


0.0862 




medicament 


0.0692 




drug 


0.0042 


T ermium 


drogue 


0.0889 




medicament 


0.0534 




drug 


0.0101 




medicamenteux 


0.0049 




stupe fiant 


0.0046 


Mixed 


drogue 


0.0746 




medicament 


0.0715 




stupefiant 


0.0062 




medicamenteux 


0.0020 




remede 


0.0018 



Table 10. Related translation words of 
"union" 



Model 


Translation 


Prob. 


Hansard 


syndicat 


0.0781 




communaute 


0.0358 




union 


0.0323 




collectivite 


0.0125 




syndical 


0.0111 




syndique 


0.0042 




cee 


0.0036 




unir 


0.0032 


WAC 


syndicat 


0.0666 




union 


0.0508 




syndical 


0.0341 




communaute 


0.0158 




ue 


0.0153 




collectivite 


0.0131 


T ermium 


union 


0.0961 




syndicat 


0.0327 




communautaire 


0.0146 




assemblage 


0.0049 




assemblee 


0.0133 




syndical 


0.0123 




communaute 


0.0094 




collectivite 


0.0067 




community 


0.0044 


Mixed 


union 


0.0673 




syndicat 


0.0546 




communaute 


0.0185 




syndical 


0.0167 




communautaire 


0.0131 




collectivite 


0.0098 




assemblage 


0.0060 




assemblee 


0.0055 




ue 


0.0037 



6.2 Translation of Componnd Terms 

The example of "European Union" also shows the necessity to translate compound 
terms as a unit, instead of translating them word by word. By translating a compound 
term together, the word "union" in "European Union" is much less ambiguous than 
when it is translated in isolation. To do this, two approaches are possible. 1) One can 
detect compound terms in the parallel training texts before using the texts for model 
training. These compound terms will be considered as a "word" in the IBM model 1. 
If this "word" appears in a query, then it is translated as a unit (possibly by a 
compound term in the target language). 2) One can also use a higher model than IBM 
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1. In fact, in IBM I, words in a sentence are considered independently. In order to 
capture the relationship between words in a compound term, or to capture some 
contextual information, it would be useful to use at least a language model together 
with the translation model. That is: 



(/b/2, = argmax P((/l,/ 2 , = argmax IIki,/) W^\E)*Wi,f 2 , ■■■fi)- ( 1 ) 

where V{f\,f 2 , is a language model that estimates the probability of fufi, ■■■ fi 
appearing together in the target language, and P(/i|£) is the translation model. 

In so doing, the best translation words would be those that not only have high 
translational probability, but also have a high probability to appear together in the 
target language. An alternative is to use IBM model 2 or 3 instead of model 1 . 



6.3 The Effect of Mixing Models 

From the above translation examples, we cannot observe any advantage form 
combining different models together. In fact, the mixed model only takes the 
translation words from different models, and re-calculates their probability according 
to a linear combination. This does not affect the ambiguity problem. Ambiguous 
words remain as ambiguous as before the model combination. 

The parameters used for linear combination of models are estimated on a small set 
of held-out data that are not necessarily adapted to the IR document collection used in 
these experiments. A better way to train the parameters is to use a similar IR test 
collection (e.g. the collections used for the CLIR tracks at TREC). The combination 
with better tuned parameters could allow us to achieve higher effectiveness than 
single translation models. We can also think about a different combination method 
than linear combination, or a method for estimating combining coefficients that 
corrects for vocabulary-size bias. 



6.4 Coverage of the Models 

The effectiveness of each translation model may be strongly affected by itscoverage. 
A model that produces many unknown words will not be able to translate many key 
concepts correctly (except for proper names). Table 6 showed the number of unknown 
words when translating queries to English from different languages. The G-E and I-E 
cases are comparable in both the size of training corpora and types of pre-processing 
on Italian and German. Nevertheless, there was a large difference between G-E run 
and I-E run. A strong factor that may have affected these performances is unknown 
words. Below we show the case of a query on "electroweak theory^. All the words 
marked * are unknown words. 



' Although this query does not contribute to official effectiveness measurements (because 
there is no relevant document in the English and French collections), it does show the 
potential problem that low coverage may cause. 
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Table 11. Word coverage in translating the query on "electroweak theory" 



Model 


Known words 


Top translation words 


Hansard 


*electroweak 


theorie 


0.0505 




*subnuclear *weinberg- 


nucleaire 


0.0497 




salam-glashow 


decouverte 


0.0326 




* subatomic *quark 


confirmer 


0.0323 




*photon 


proposer 


0.0231 






domaine 


0.0223 






modele 


0.0184 






physique 


0.0182 


WAC 


*electroweak 


theorie 


0.0683 




*subnuclear *weinberg- 


physique 


0.0378 




salam-glashow 


nucleaire 


0.0353 




* subatomic *photon 


decouverte 


0.0281 






domaine 


0.0260 






modele 


0.0243 






proposer 


0.0238 






confirmer 


0.0238 


Termiu 


* weinb erg- salam- 


theorie 


0.0864 


m 


glashow 


electrofaible 


0.0605 






nucleaire 


0.0548 






physique 


0.0457 






particule 


0.0368 






infra-atomique 


0.0303 






quark 


0.0303 






modele 


0.0281 


Mixed 


*weinberg-salam- 


theorie 


0.0740 




glashow 


nucleaire 


0.0505 






physique 


0.0336 






decouverte 


0.0298 






electrofaible 


0.0250 






particule 


0.0239 






modele 


0.0237 






confirmer 


0.0229 



As we can see, most unknown words are key concepts of the query. The Hansard 
model seems to have the worst coverage for this query. Most of the key concepts are 
unknown. The model based on parallel web pages is slightly better. The Termium 
lexical database is the best for this query. It recognizes all the concepts, except the 
proper names. For this particular query, the effectiveness would have been greatly 
affected by the coverage of the models if its effectiveness were measured. 

For other queries, there are only a few unknown words. In the English to French 
case, the Hansard model encountered 17 unknown words in total, the WAC model 13 
and the Termium model 14. This appears surprising for the WAC model, which is a 
resource constructed without manual control. This shows that an automatically 
constructed parallel corpus can have a very good coverage. 
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7 Final Remarks 

In this CLEF, we successfully used parallel Web pages to train several translation 
models for language pairs other than English and French. Our experiments on mining 
the web for parallel texts further confirm that the automatic mining approach is 
feasible for many language pairs. 

For monolingual IR, we used some basic IR methods, including simple stemmers 
and publicly available stoplists. The effectiveness for French and Italian monolingual 
IR is similar to the medium performance. The German monolingual run is well below 
the medium. We think the main reason is that we did not carry out any particular 
processing on German morphology, which is an important problem for German IR. 

For bilingual IR between English and French, and between English and Italian, the 
effectiveness seems to be reasonable. It is better than the medium effectiveness. 
Between English and German, however, the effectiveness is well below the medium 
effectiveness. The reason may be the same as for the German monolingual run. 

For multilingual runs, the performance is below the medium. We believe the 
reason is once again the low effectiveness for German. In addition, result merging 
may also have affected the global effectiveness. 

Between English and French, we also tried to combine different resources in our 
translation models. We used a linear combination of the models trained with different 
data, including two dictionaries, a manually constructed parallel corpus, an 
automatically constructed parallel corpus and a lexical database. The coefficients of 
the combination were determined using a small set of held-out data. However, to our 
surprise, the mixed model performed worse than the model trained with the Web 
documents only. In fact, its effectiveness is lower than any of the individual 
translation models. This clearly indicates that the combination is not well suited to the 
CLEF data. In other words, the held-out data do not correspond to the document 
collection used in these IR experiments. 

These experiments reveal several problems in using statistical translation models 
for CLIR. 1) The IBM model 1 has difficulty translating ambiguous words correctly. 
In order to deal with this problem, we will need to take into account a language model 
or use a more elaborate translation model in the future. 2) Compound terms should be 
translated as a whole, instead of being decomposed into single words. 3) Models 
should be combined in a better way. These are some of the problems we will study in 
our future research. 
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Abstract. This paper presents the experiments undertaken by the IRIT 
team in the multilingual, bilingual and monolingual tasks of the CLEF 
evaluation campaign. Onr approach is based on query translation. The 
queries were translated using free dictionaries and then disambiguated 
using an aligned corpus. The experiments were done using our connex- 
ionist system Mercure. 



1 Introduction 

The goal of Cross Language Information Retrieval (CLIR) is to retrieve docu- 
ments from a pool of documents written in different languages in response to a 
user’s query written in one language. Thus in CLIR, the initial query is in one 
language and the document can be in another. 

CLIR is mainly based on information translation. Different approaches P] have 
been considered in the literature: machine translation, machine readble dictio- 
nary and corpus based approach. These techniques can be used to translate 
either the query terms or the document terms. 

The main problems of CLIR are: finding the possible translations of a term, and 
deciding which of the possible translations should be retained (the disambigua- 
tion problem). The paper presents our experiments at CLEFl: multilingual, 
bilingual and monolingual. Our approach to CLIR is based on query translation 
using dictionaries. 

In the multilingual experiment, two merging techniques were tested: a naive 
strategy and a normalised strategy. In the bilingual experiment a dictionary is 
used to translate the queries from French to English and a disambiguation tech- 
nique based on the query context is then applied to select the best terms from 
the (translated) target queries. 

All these experiments were done using the Mercure system 0 which is presented 
in Section El of this paper. Section 0 describes our general CLIR methodology, 
and finally. Section 0 describes our experiments and the results obtained at 
CLEFl. 



C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 202- 1™! 2001. 
(c) Springer-Verlag Berlin Heidelberg 2001 



Mercure at CLEF-1 



203 



2 Mercure System 

2.1 Description 

Mercure is an information retrieval system based on a connectionist approach 
and modelled by a multi-layered network. The network is composed of a query 
layer (set of query terms), a term layer representing the indexing terms and a 
document layer 

Mercure includes the implementation of a retrieval process based on spreading 
activation forward and backward through the weighted links. Queries and doc- 
uments can be either inputs or outputs of the network. The links between two 
layers are symmetric and their weights are based on the tf * idf measure inspired 
by the OKAPIjS] term weighting formula. 

— the term-document link weights are expressed by: 



^ _ tf^j * {hi + h 2 * log{^)) 

hs + h4* ^ + h5* tfij 

— the query-term (at stage s) links are weighted as follows: 

I Qtfui otherwise 



( 1 ) 



( 2 ) 



2.2 Query Evaluation 

A query is evaluated using the spreading activation process described as follows: 

1. The query is the input of the network. Each node from the term layer 
computes an input value from this initial query: In{ti) = Qui and then an 
activation value: Out{ti) = g{In{ti)) where g is the identity function. 

2. These signals are propagated forwards through the network from the term 
layer to the document layer. Each document node computes an input: 
In{dj) = Y^J^^ Out{ti)*Wij and then an activation, Out{dj) = RSV{Qu, dj) = 
g{In{dj)). 

Notations : 

T : the total number of indexing terms, 

TV: the total number of documents, 

Qui' the weight of the term ti in the query it, 

tf. the term 

dj: the document dj, 

Wij : the weight of the link between the term ti and the document dj , 
dlj\ document length in words (without stop words). 

Ad: average document length, tfij: the term frequency of U in the doc- 
ument dj, 

nf the number of documents containing term ti, 
nqu'. the query length, (number of unique terms) 
gtfui- query term frequency. 
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3 General CLIR Methodology 

Our CLIR approach is based on query translation. It is illustrated by Fig. [H 




List of documents 



Fig. 1. General CLIR approach 



— Indexing: a separate index is built for the documents in each language. 
English words are stemmed using the Porter algorithm, French words are 
stemmed using a truncature (7 first characters), no stemming for the Ger- 
man and Italian words. The German and Italian stoplists were downloaded 
from Internet. 

— Translation : is based on “dictionaries”. For the CLEFl experiments, four 
bilingual dictionaries were used, all of which were actually simply a list of 
terms in language 1 1 that were paired with some equivalent terms in language 
12 . Tabled shows the source and the number of entries in each dictionary. 

— Disambiguation: when multiple translations exist for a given term they are 
generally relevant only in a specific context. The disambiguation consists of 
selecting the terms that are in the context of the query. We consider that 
a context of a given query can be represented by the list of its terms. The 
disambiguation process consists of building a context of the target query and 
using this context to disambiguate the list of substitutions resulting from the 
query source translation. 

A context of the target query is built using an aligned corpus. It consists 
of selecting the best terms appearing in the top (X=12) documents in the 
target language aligned to the top (X=12) retrieved by the query source. 



Mercure at CLEF-1 



205 



The terms are sorted according the following formula: 

score{ti) = ^ dik 

dk&D^ 

Dx : set of aligned documents to those retrieved by the source query, 
dik '■ weight of term ti in document dk- 
The disambiguation of the translated query consists of retaining only terms 
that appear in the list of terms of the target context. However, if a specific 
term has an unique substitution this term is retained even though it does 
not exist in the context of the target query. Note that in this process all the 
terms appearing in the target context are retained; we do not select only the 
best translation as is done in some other studies p. 



Table 1. Dictionary characteristics 



Type 


Source 


nb. entries 


E2F 


http:/ / WWW. freedict.com 


42443 


E2G 


http:/ /www. freedict.com 


87951 


E2I 


http:/ /www. freedict.com 


13478 


F2E 


http:/ /www. freedict.com 


35200 



4 Experiments and Results 

4.1 Multilingual Experiment 

Two runs using English topics and retrieving documents from the pool of docu- 
ments in all four languages (German, French, Italian and English), were submit- 
ted. The queries were translated using the downloaded dictionaries. There was 
no disambiguation, all the translated words were retained in the target queries. 
The runs were performed by doing individual runs for language pairs and merg- 
ing the results to form the final ranked list. Two merging strategies were tested: 



— naive strategy: all the documents resulting from the bilingual searches are 
entered in a final list. These documents are then sorted according to their 
RSV. The top 1000 were submitted. 

— normalised strategy : each list of retrieved documents resulting from the 
bilingual searches was normalised. The normalisation consists simply of di- 
viding the RSV of each document by the maximum of RSVs in that list. The 
documents of the different lists are then merged and sorted according to their 
normalised RSV. The final list corresponds to the top 1000 documents. 
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Table 2. Comparison with median at average precision 



iritlmen2a irit2men2a 

better than median Avg. Free. : 15 (best 0) 16 (best 0) 

worse than median at Avg. Free. : 25 (worst 2) 24 (worst 1) 



Two runs were submitted : iritlmen2a based on normalised merging and 
irit2men2a based on naive merging. 

Table Qcompares our runs against the published median runs. We note that 
for both runs the number of topics above and below the median are fairly similar. 



Table 3. Comparisons between the merging strategies 



Run-Id 


F5 


FIO 


F15 


F30 


Exact 


Avg. Free. 


iritlmen2a 

irit2men2a 


0.3750 

0.3950 


0.3250 

0.3400 


0.2900 

0.3017 


0.2433 

0.2500 


0.1996 

0.2284 


0.1519 

0.1545 



Table 0 compares the merging strategies. It can be seen that the naive strat- 
egy is slightly better than the normalised strategy in the top document, and at 
exact precision but no difference at average precision. Nothing was gained from 
the normalised strategy. 



Table 4. Comparison with median at average precision 



Language pair 


F5 


FIO 


F15 


F30 


Exact 


Avg. Free. 


E2F (34 queries) 


0.2941 


0.2118 


0.1824 


0.1353 


0.2185 


0.2046 


E2G (37 queries) 


0.2378 


0.2189 


0.1910 


0.1396 


0.1683 


0.1489 


E2I (34 queries) 


0.1882 


0.1647 


0.1333 


0.0843 


0.1877 


0.1891 


E2E (33 queries) 


0.5091 


0.4212 


0.3677 


0.2798 


0.4490 


0.4611 



Table 0 shows the results per language pair (example, E2F means English 
queries translated to French and compared to French documents, etc.). We can 
easily see that the monolingual (E2E) search performs much better than all the 
bilingual (E2F, E2G, E2I) searches. Moreover, all the bilingual searches (except 
E2G) have a better average precision than the best multilingual search. The 
merging strategy adopted caused the loss of relevant documents. Table 0 shows 
the total number of relevant documents in the bilingual lists and the number of 
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documents which were kept in the final list and were lost when merging. Relevant 
documents were lost from all the bilingual lists. 



Table 5. Comparison between the number of relevant documents in Bilingual 
and Multilingual lists 





E2E 


E2F 


E2I 


E2G 


Rel. Ret. by bilingual list 


554 


389 


228 


467 


Rel. kept in the final list 


500 


281 


152 


296 


Rel. lost. 


54 


107 


76 


171 



4.2 Bilingual Experiment 

The bilingual experiment was carried out using an F2E free dictionary -|- dis- 
ambiguation. The disambiguation was performed using WAC (Word-wide-web 
Aligned Corpus) parallel corpus built by RALI Lab (http://www-rali.iro. 
umontreal.ca/wac/). 



Table 6. Comparative bilingual F2E results at average precision 



iritlbfr2en 

better than median Avg. Free. 22 (best 3) 
worse than median at Avg. Free. 11 (worst 2) 



Table Q compares our run against the published median runs. Most queries 
give results better than the median and 3 were the best. 

Table □ presents the disambiguated queries. We note that of 33 queries, 13 
have been disambiguated. We note that 10 of these queries have improved their 
average precision, and the total number of relevant document has grown from 
371 to 399. 

Table |H| compares the results between the runs Dico-l-disambiguation and 
Dico only. The disambiguation is shown to be effective as the average precision 
improves by 6%. 

4.3 Monolingual Experiments 

Three runs were submitted as monolingual tasks: iritmonofr, iritmonoit, irit- 
monoge 
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Table 7. Impact of the disambiguation based on aligned corpus 





Dico 


Dico+Disambiguation 




Total, of Relevant Doc. 


371/579 


399/579 




Queries 


Avg.Prec 


Avg.Prec. 


Impr.(%) 


1 


0.6420 


0.6420 


0% 


5 


0.0041 


0.0528 


1187.8% 


13 


0.1453 


0.1486 


2.27% 


14 


0.1218 


0.1218 


0% 


16 


0.5775 


0.5769 


-0.10% 


17 


0.6077 


0.6274 


3.24% 


18 


0.0014 


0.0398 


2742.86% 


19 


0.7365 


0.7791 


5.78% 


24 


0.3101 


0.3293 


6.19% 


28 


0.0191 


0.0387 


102.6% 


29 


0.5833 


0.5909 


1.30% 


31 


0.0020 


0.0021 


5% 


33 


0.0395 


0.0664 


68.10% 



Table 8. Impact of the disambiguation 



Run-id (33 queries) 


P5 


PIO 


P15 


P30 


Exact 


Avg.Prec 


Dico-|-Des. 


0.3152 


0.2636 


0.2182 


0.1636 


0.2841 


0.2906 


Dico 


0.2788 


0.2515 


0.2000 


0.1566 


0.2685 


0.2741 


Impr (%) 


13 


4.8 


9 


4.5 


5.8 


6 



Table 9. Comparison between monolingual searches 



Run-id (33 queries) 


P5 


PIO 


P15 


P30 


Exact 


Avg. Prec. 


iritmonofr FR (34 queries) 


0.4765 


0.4000 


0.3510 


0.2637 


0.4422 


0.4523 


iritmonoit IT (34 queries) 


0.4412 


0.3324 


0.2490 


0.1637 


0.4182 


0.4198 


iritmonoge GE (37 queries) 


0.4108 


0.3892 


0.3550 


0.2766 


0.3197 


0.3281 
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Table El shows that French monolingual results seem to be better than both 
Italian and the German. Italian results are better than German. These runs were 
done using exactly the same procedures the only difference concerns the stem- 
ming which was used only for French, and we notice clearly that the monolingual 
search is much better than both the multilingual and the bilingual searches. 

5 Conclusion 

In this paper we have presented, our experiments for GLIR at GLEFI. 

In multilingual IR, we showed that our merging strategies caused the loss of rel- 
evant documents, In bilingual IR, we showed that the disambiguation technique 
for translated queries is effective. Results of experiments have also shown that 
it is feasible to use free dictionaries, and disambiguation based on an aligned 
corpus gives good results even though the documents of the aligned corpus are 
independent of those of database. 

In our future work, we will try to find a way to solve the problem of merging. 
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Abstract. We designed, implemented and evaluated an automated method for 
query construction for CLIR from Finnish, Swedish and German to English. 
This method seeks to automatically extract topical information from request 
sentences written in one of the source languages and to create a target language 
query, based on translations given by a translation dictionary. We paid 
particular attention to morphology, compound words and query structure, we 
tested this approach in the bilingual track of CLEF. All the source languages 
are compound languages, i.e., languages rich in compound words. A compound 
word refers to a multi-word expression where the component words are written 
together. Because source language request words may appear in various 
inflected forms not included in a translation dictionary, morphological 
normalization was used to aid dictionary translation. The query resulting from 
this process may be structured according to the translation alternatives of each 
source language word or remain as an unstructured word list. 



1 Introduction 

NLP -techniques have been tested for IR and CLIR for several years. The point of view 
has been that linguistically motivated database indexing and query construction would 
enable the catching of sense in text and in queries differently from the non-linguistic 
methods used in IR, for example weighting based on word occurrence statistics. 
Traditional NLP-techniques have been extended also to the sub-word level, i.e., 
morphological decomposition and stemming [1], So far, great success in increasing the 
quality of retrieval results due to these techniques has not been reported, compared to 
statistical methods. In CLIR, the use of NLP-techniques is almost a necessity because 
one is dealing with languages which are morphologically more complex than English. 

One of the main approaches to CLIR is based on bilingual translation dictionaries. 
For an overview of the main approaches, see [2], [3], [4] [5]. In this paper, we adopt a 
dictionary-based approach to CLIR. The main problems associated with such an 
approach are 1) phrase identification and translation, 2) source language ambiguity, 3) 
translation ambiguity, 4) the coverage of dictionaries, 5) the processing of inflected 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 210-223, 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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words, and 6) untranslatable keys, in particular proper names spelled differently in 
different languages [6]. 

Our approach to solve the general problems for bilingual CLIR is based on 1) word 
form normalization in indexing, 2) stopword lists, 3) normalization of topic word 
forms, 4) splitting of compounds, 5) recognition of proper components of compounds, 
6) phrase composition in target language, 7) bilingual dictionaries, and 8) structured 
queries. 

All the source languages we use, Swedish, Finnish and German, are languages rich 
in compounds. It therefore is essential to develop techniques for the processing of 
compounds. Second our interest is to compare structured and unstructured queries to 
solve the ambiguity problem with CLIR. We used a model for query structuring 
developed and tested for Finnish - English CLIR by Pirkola [7]. 



2 Research Questions 

The research questions are: 

1. By what process, using bilingual dictionaries, can we automatically construct 
effective target language queries from source language request sentences? 

2. How does retrieval effectiveness vary when source languages vary? 

3. How does query structure affect CLIR effectiveness when using different source 
languages? 

The first research question involves designing and implementing our approach to 
automated bilingual query construction using generally available bilingual 
dictionaries. The method seeks to automatically extract topical information from 
search topics in one of the source languages and to automatically create a target 
language query. The resulting query may either be structured or unstructured. We will 
compare the effectiveness of structured and unstructured queries. 

Our tests for the second research question include three different language pairs, 
Finnish, Swedish and German as source languages and English as the target language 
(for short FIN | SWE | GER -> ENG CLIR). We have tested the use and effects of 
morphological analysis programs, dictionary set-ups and translation approaches. All 
the source languages are rich in compounds, and thus, one of our main efforts is the 
morphological decomposition of compounds into constituents and their proper 
translation. In languages rich in compounds, the right translation of compounds (or 
their components) is a factor that greatly affects the retrieval results. 

Homographic word forms, especially as components in compounds tend to add 
many translation alternatives to a query. Our method for treating compounds, 
combines every translation alternative for each component into a phrase. Therefore, a 
great number of translation alternatives produces an excessive number of 
combinations. A rich inflected morphology (in Finnish) is also a factor that affects the 
retrieval result, particularly when trying to identify and handle proper names. 

The third research question involves constructing both structured and unstructured 
queries for all language pairs and testing their effectiveness. Query structure is the 
syntactic structure of a query expression, as expressed by the query operators and 
parentheses. The structure of queries may be described as weak (queries with a single 




212 



Turid Hedlund et al. 



operator or no operator, no differentiated relations between search keys) or strong 
(queries with several operators, different relationships between search keys) [7], [9], 
In this study, queries with a single operand and no differentiated relations between 
search keys are called unstructured queries, and queries with synonym relations 
between search keys translated from the same source language word are called 
structured queries. 



3 Research Settings 



3.1 Document Collection and Test Topics 

The LA Times document database was indexed as document collection. Our approach 
for database indexing in the target language is based on word form normalization, 
using the morphological analysis program ENGTWOL. We allow ambiguity (e.g. 
multiple base forms for a word) and language inconsistency (e.g., seat belt, seat-belt, 
seatbelt) in the text. Unrecognized word forms could not be normalized and were thus 
labeled as such (e.g., proper names were specially marked as unrecognized). 

The CLEF test topics include title, description and a narrative. For CLIR purposes and 
automated query construction, it seems favorable to keep the test requests relatively 
short, as 2-3 sentences. Therefore we automatically selected the title and description 
field only. We used the Finnish, Swedish and German test topics. 



3.2 The Query Construction Processes 

Our approach in the query formulation process in the source languages included word 
form normalization, the removal of source language stopwords, and compound 
splitting into proper components in their base forms for recognition in dictionaries. 
This meant, e.g., handling of fogemorphemes in Swedish and German, and inflection 
in Finnish. Fogemorphemes are morphemes joining constituents in compounds, e.g., 
“s” in the word rdtt^all (legal case). We applied phrase construction in the target 
language for the compounds in the source languages and labeled unrecognized word 
forms (e.g., proper names) as done in the indexing phase. The unrecognized word 
forms were used as such, disregarding possible inflection. In all these phases we 
allowed ambiguity, i.e. multiple possible interpretations for the source language word 
forms. The translation is structured using the synonym set structure [7] to reduce 
ambiguity effects. The synonym sets were the target language word sets as given by 
the bilingual dictionaries, for each source language word. 
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(a) 



/Word Base\ 




Fig. 1. (a) (b) General description of the automatic query construction process 



The automatic query construction process takes the following 5 resources as inputs: 

1. the CLEF topic file in one source language (SWE, FIN, GER) 

2. a file, or files containing stopwords in the source language 

3. a file containing stopwords in the target language (ENG) 

4. a bilingual translation dictionary for the language pair 

5. a morphological analysis program for the source language. 
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As there are slight differences between the language pairs used, we describe the 
processing of each language pair individually in the following. 

The structured Swedish-English query processes the following five input files: the 
Swedish CLEF topic file, the Swedish stop word file, the English stop word file, the 
SWETWOL morphological analyzer for Swedish and the Motcom Swedish-English 
translation dictionary (60.000 words). The Motcom dictionary’s output contains a lot 
of information intended for a human reader. The actual translations were obtained 
from the Motcom dictionary by a filtering script. 

The structured German-English query processes the following five input files: the 
German CLEF topic file, the German stop word file, the English stop word file, the 
Duden German-English translation table for the 40 CLEF topics, and the GERTWOL 
morphological analyzer software for German. The construction of the German-English 
translation table was a separate process accomplished by a human analyzer following 
strict syntactic rules for selecting strings from the PC screen. As the dictionary system, 
Oxford Duden German dictionary (260.000 words), did not allow use through a 
program interface, and because the selection of the strings had to be based on the font 
color, this process could not be automated. However the translation table was used 
automatically. 

The structured Finnish-English query processes the following five input files: the 
Finnish CLEF topic file, the Finnish stop word file, the Motcom Finnish-English 
translation dictionary (110.000 words), and the morphological analyzer FINTWOL 
for Finnish. The translation program was modified from the program code of the 
structured German-English query translation. Finnish-English word-by-word 
translations were generated by using a command line interface to the Finnish-English 
Motcom translation dictionary. A filtering script produced in most cases a “clean” 
stream of individual words or phrases as English translation equivalents for each 
Finnish word. 



The unstructured German-English query (official CLEF run) was a simple 
modification of the corresponding structured German-English process, only removing 
structure from the structured query versions. The unstructured Finnish - English query 
process and the unstructured Swedish - English query process (both unofficial runs) 
were constructed in the same way as the German unstructured query. 



3.3 Compound Splitting 

For Swedish, Finnish and German, compound splitting and the translation of 
constituents were performed. If a compound is lexicalised and found in the machine- 
readable-dictionary used, this translation is probably less ambiguous than translating 
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(a) 



Translation of componnds 



sjalvmord soltempelorden 



sol#tempel#orden 

sol#tempel#ord 



landnigsbanan 



varldshandelsorganisationen varuhustak 



landnings#bana 

landnings#banan 



landning 



varIds#handels#organisation 



varld handd 



varu#hus#tak 



vara 



suicide #4(sun temple order) #3(ianding path) #4(worId trade organization) #4(article building ceiling) 
#4(sun temple word) #3(landing track) #4(world commerce organization) #4(article building roof) 
#3(landing line) #4(world business organization) #4(article house ceiling) 
#3(Ianding orbit) #4(world part organization) #4(goods building roof) 
#3(landing course) #4(world section organization) #4(goods building ceiling) 
#3(landing banana) #4(world area organization) #4(goods house roof) 
#4(world region organization) #4(goods house ceiling) 
#4(world volume organization) 

(b) 



Fig. 2. (a), (b). Description of the process for handling compound translation: 
(a) the process, (b) examples 



the constituents and is therefore used. For all other compounds, compound splitting is 
performed. Compounds in Swedish need special treatment since our earlier tests [8] 
indicated that the morphological analyzer for Swedish does need tuning to give proper 
results for IR purposes. To solve this problem we developed an algorithm which seeks 
to turn all the constituents of a compound to the lexical base form, which should be a 
real word and not a stem. In case of German, nouns as constituents need to get an 
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upper-case initial letter. We also removed one common fogemorpheme in German, 
namely the “s”. Proper names and other words not found in the dictionary are added to 
the query as such. The process for handling compounds is described in Fig 2. 



3.4 Query Structuring 

Query structuring was done by using the syn operator provided in the InQuery 
retrieval software. Every translation alternative for a word in the translation dictionary 
is added to the query as a synonym. The Synonym operator's syntax is: #syn(T i ... T„), 
where Tj (l<i<n) is a term. The terms within this operator are treated as instances of 
the same term for belief score computation. In other words, the translation of the word 
mote becomes #syn(encounter meeting crossing appointment date). A compound in 
the source language that is translated by a dictionary as a phrase needs to be marked 
with a proximity operator. The Ordered Distance operator's syntax is: #N(Ti ... TQ or 
#odN (Ti ... Tn), where N is the distance, and Ti (l<i<n) is a term. The terms within an 
ordered distance operator must be found within N words of each other in the text in 
order to contribute to the document's belief score. The #N version is an abbreviation 
of #odN; therefore #3(health care) is equivalent to #od3(health care). 

The Weighted Sum operator's syntax is #wsum (Ws Wi Ti... W„ T„), where Ws is 
the query weight, Wi (l<i<n) is a term weight for the term Ti (l<i<n). The terms are 
scored according to their weights in addition to their occurrence statistics. The final 
belief score is scaled by W^, the weight associated with the #wsum itself For example: 
#wsum(l larchitecture 2Berlin) weights Berlin twice as heavily as architecture. 



4 Analysis 

We shall first discuss some of the problems in the query formulation process and then 
present the evaluation results. 



4.1 Analysis of the Problems in the Query Formulation Process 

Major problems in our approach relate to matching, proper names, and semantics. In 

addition, we identified some language-specific problems. 

Matching problems: 

One of the major problems was matching the translation output to the database index. 

- proper names although correctly translated do not match the index words in the 
document database, i.e., the form “USA” or “usa” is not recognized by the 
morphological analysis program ENGTWOL for English. 

- words translated to English by a dictionary can be in inflected form. For example, 
the query words “taking” and “drugs” never matched any index words of the LA 
Times database. The reason for this is that the ENGTWOL program used in the 
index building process produced word forms “take” and “drug”, respectively, in the 
index of the database. 
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Both these problems are solved if we run the dictionary translation through the 
morphological analyzer, thus normalizing all recognized word forms in the same way 
as they appear in the document database index. Unrecognized word forms in 
translation are labeled in the same way as words in the index. 

Proper names: 

Proper names are difficult to translate, because they normally do not appear as entries 
in dictionaries except for common geographical names. Still there are differences in 
spelling and variations in forms in different languages, i.e. Nice - Nizza. Proper names 
in inflected forms are not normally recognized by the morphological analyzers, and 
this makes normalization to base form impossible. 

Semantic problems: 

Our test queries show a great variation in length. In general the Swedish - English 
queries are shorter and the Finnish - English and German - English queries are 
considerably longer. For Swedish - English we have an average query length of 29 
words, for Finnish - English the average query length is 55 words and for German - 
English queries 68 words. 7 of the German - English queries are over 100 words and 
some of them extremely long up to 528 words. However the performance of the query 
cannot be directly related to its length. Table 1 gives an overview of query length for 
each language. The length of the query depends on: 

- dictionaries, and the number of translation alternatives for a word. 

- compound words in the source language. When splitting compounds into three or 
four constituents the number of translation alternatives and their combinations grow 
rapidly. 

- homographic words with many senses. Frequent words not in the stop list of the 
source language tend to have many senses, and they also tend to appear as 
constituents in compound words. 



Table 1. Overview of query length in the target language, for all source languages 



Query length 
in words (n) 


Number 

Swe-Eng 


of queries 
Ger-Eng 


Fin-Eng 


n<=10 


7 


2 


0 


10<n<=20 


14 


7 


4 


20<n<=30 


6 


8 


3 


30<n<=50 


2 


8 


11 


50<n<=100 


3 


1 


12 


100<n 


1 


7 


3 




33 


33 


33 



In some cases important concepts are not translated, which tend to ruin the whole 
query. The problem is in most cases related to the dictionaries used: 
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- if the word is not in the dictionary it is used as such in the query 

- compound words have constituents that are not translated and due to this the 
translated phrases come to include words in the source language which never 
appear together with the translated ones in the document text. I.e., the Swedish 
word bmndbekdmpningsolyckor (Fire-fighter casualties) is translated as #4(fire 
bekampning accident). 

Language Specific Problems: 

Swedish: The morphological analyzer needs to be tuned for the normalization of 
constituents when splitting compounds. The algorithm we used for handling 
fogemorphemes appears to work well in the query formulation process and reduces the 
number of non-translated words in several topics. However, since we deal with 
constituents of compounds the actual effect on the search result also depends on other 
factors, such as to what extent the constituent bear important search keys. 

German: The German language has the special feature of capital initial letter in nouns, 
and also the double “s” 13 in text. We utilized morphological information of nouns in 
German in order to match German noun keys more precisely into translation 
dictionary entries. The capital initial letter was identified in all the input files: CLEF 
topic file, German stop word file and the Duden German-English translation table for 
the 33 CLEF topics. When splitting the compounds the noun constituents also had to 
get the capital initial letter in order to be translated. Fogemorphemes in German were 
treated in a similar way as in the Swedish process. In this case we only identified one 
of the most common fogemorphemes. 

Finnish: The Finnish language is special in having a very rich inflectional 
morphology, and instead lacking prepositions. The morphological analyzer works well 
and the normalization process has no greater obstacles. Most problems are caused by 
inflectional forms of proper names. These typically cannot be normalized since the 
morphological analysis program cannot identify them. 



4.2 Test Runs 

The results of the four official test runs (Finnish structured, Swedish structured, 
German structured and German unstructured) and the two unofficial runs (Finnish 
unstructured and Swedish unstructured) show comparable performance for three 
separate source languages (Fig. 3). The best average performance is by the German 
structured run, and the lowest by the Finnish unstructured. The average precision 
figures over recall levels are as follows (Table 2). 
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Fig. 3. Interpolated recall-precision averages 



Table 2. Interpolated recall - precision averages 



recall level 


Swestr 


Sweuns 


Finstr 


Finuns 


Gerstr 


Geruns 


0,0 


0,6007 


0,5666 


0,5827 


0,4128 


0,6752 


0,5492 


0,1 


0,4566 


0,4314 


0,4625 


0,3111 


0,5262 


0,4473 


0,2 


0,4021 


0,3581 


0,4344 


0,2855 


0,4287 


0,3728 


0,3 


0,3178 


0,2587 


0,3542 


0,2343 


0,3340 


0,2837 


0,4 


0,2743 


0,2259 


0,2610 


0,1990 


0,2761 


0,2318 


0,5 


0,2480 


0,2044 


0,2146 


0,1762 


0,2596 


0,2152 


0,6 


0,1985 


0,1740 


0,1472 


0,1066 


0,1901 


0,1582 


0,7 


0,1752 


0,1415 


0,1012 


0,0560 


0,1556 


0,1191 


0,8 


0,1441 


0,1128 


0,0655 


0,0349 


0,1270 


0,0887 


0,9 


0,1012 


0,0793 


0,0419 


0,0196 


0,0952 


0,0593 


1,0 


0,0740 


0,0565 


0,0229 


0,0072 


0,0727 


0,0418 


Average 


0,2540 


0,2190 


0,2275 


0,1586 


0,2665 


0,2164 



Structured - uustructured queries 

We tested structured / unstructured query performance for all the language pairs. 
German - English as official run and Swedish - English and Finnish - English as 
unofficial runs. The results indicate better performance for the structured queries. Our 
earlier findings [7] with Finnish - English CLIR suggest that the difference in 
performance for this language pair is larger. The unofficial runs show a better 
performance also in this case for the Finnish structured queries compared to the 
unstructured (by 7% on the average). For Swedish - English structured / unstructured 
queries the difference is about the same as for German - English (3 - 5% on the 
average). One of the reasons may be the size of the dictionaries. The smaller the 
dictionary the more common one-to-one relations between source and target words 
are, and the closer syn-based structured queries are to unstructured queries. Query 
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length in Swedish - English is shorter, which might be explained by smaller size of the 
Swedish - English dictionary. This does not however explain the difference in 
performance between the Finnish - English and the German - English unstructured 
queries compared to the respective structured queries. 

Individual query performance 

Examining individual query performance of our official runs for each of the 33 topics 
we find that our results in general tend to be above the median value for all the 
participating runs. On the other hand, we can report very good results for some topics 
and then complete failures for some, the variation being quite large (Fig. 4). This is 
true for all the language pairs. A common feature for all the extreme cases is that, in 
the positive case, all succeeded in translating important concepts as proper names or, 
in the negative, failed in this. Query number 12 (all languages) and 19 (Swedish) 
failed because of a wrong translation of the names Order of the Solar Temple (12) and 
Persian Gulf syndrome (19). The Finnish query (number 30) failed because of the 
lack of translation for Nice, while the Swedish and German structured got the best 
possible performance, although this was an extremely long query in German. The 
Finnish query (number 37) also failed because of a proper name Estonian in inflected 
form. 



Histogram comparing the Finnish run to the 
median value for all groups 
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Histogram comparing the German structured 
run to the median value for all groups 
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Histogram comparing the Swedish run to the 
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Document cut-off values 

The average precision at different document cut-off value for our test runs show an 
extremely similar performance for five of our six runs (Table 3). The German 
unstructured (Geruns) run has a lower precision in the beginning (up to 15 retrieved 
documents), but the three best runs German structured (Gerstr), Swedish structured 
(Swestr), and Finnish structured (Finstr) are very close for the whole range. The 
difference between the Swedish structured and the Swedish unstivictured ivin is very 
small. The Finnish unstructured run differs clearly from the other runs and has a much 
lower performance. 




Fig. 5. Average precision at document cut-off values 



Table 3. Average precision at some document cut-off values 5 - 1000 



Precision at 5, 10, 15 1000 docs retrieved 


Docs 


Swe 


Sweuns 


Fin 


Finuns 


Ger str 


Ger uns 


5 


0,3030 


0,2848 


0,3212 


0,2364 


0,3394 


0,2545 


10 


0,2667 


0,2455 


0,2727 


0,2030 


0,2758 


0,2515 


15 


0,2242 


0,2182 


0,2323 


0,1798 


0,2283 


0,2202 


20 


0,2045 


0,2061 


0,2030 


0,1591 


0,2000 


0,2030 


30 


0,1747 


0,1778 


0,1697 


0,1303 


0,1687 


0,1667 


100 


0,0921 


0,0921 


0,0827 


0,0676 


0,0867 


0,0824 


200 


0,0526 


0,0556 


0,0477 


0,0395 


0,0526 


0,0492 


500 


0,0241 


0,0259 


0,0226 


0,0193 


0,0245 


0,0230 


1000 


0,0128 


0,0139 


0,0127 


0,0109 


0,0131 


0,0125 


Exact 


0,2664 


0,2368 


0,2452 


0,1842 


0,2793 


0,2242 
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5 Conclusions 

We participated in the bilingual CLEF-track with four official runs and 2 unofficial 
additional runs, using three different source languages. The first research question we 
raised: by which process, using bilingual dictionaries can we automatically construct 
effective target language queries is answered in this paper. The processes we 
developed and implemented, focusing on proper handling of compound words, and 
inflectional morphology worked to our satisfaction. We have analyzed quite a few 
problems encountered in the query construction process, and can also contribute with 
some solutions for them for the next year CLEF conference. 

The second research question was about the variations in retrieval effectiveness 
depending on the source language used. Our CLEF results show very similar retrieval 
performances for all the three source languages, yet we have discovered and in the 
paper analyzed differences in the query construction process. Analyzing single queries 
we discover differences between the languages, but since the results for the runs are 
average figures for all 33 requests the differences fade out. 

The same thing can be said about structured queries compared to unstructured, 
individual queries show large differences (in either direction) while the average effect 
is much smaller. Nevertheless, structured queries were better, on the average for all 
language pairs. In the official runs the effectiveness for the German structured is better 
on a document cut-off value 15-20, which is the most important region from the user 
point of view. The effect of query structuring on the Finnish - English language pair 
seems to differ from the other two. The Finnish unstructured queries are clearly less 
effective in retrieving relevant documents. 
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Abstract. This paper describes our participation in the CLEF bilingual retrieval 
task (formulating queries in Spanish to retrieve documents in English), using an 
information retrieval (IR) system based on the vector model. Our aim was to 
use a simple approach to solve the problem, without expecting to obtain great 
results, especially owing to the short time available. The queries formulated in 
Spanish were translated to English by a commercial machine translation system. 
The translations were filtered to eliminate stop words, and then the remaining 
terms were stemmed using a standard stemmer. Results were poorer than those 
obtained through monolingual retrieval with original English queries, the 
difference being slightly over 15%. 



1 Introduction 

This study describes the participation of our team in the Cross-Language Evaluation 
Forum (CLEF-2000), as a first approach to bilingual information retrieval. Our main 
objective in participating in CLEF was to gain experience in the task of bilingual 
information retrieval with Spanish and English, although we have greater experience 
in monolingual information retrieval in Spanish. Our participation in CLEF 2000 
focussed on bilingual retrieval, using queries in Spanish with a collection of 
documents in English. Obviously, we also worked with the same queries, formulated 
originally in English, in order to establish a base-line for comparison of results. 

The IR problem when more than one language is involved, i.e. evaluating the 
similarity of a document written in a given language versus a query in another one, is 
that of achieving homogeneous representations of both elements (document and 
query) which may be compared in order to establish a degree of similarity between 
them [6]. Once this homogeneous representation has been achieved, the similarity 
between a query and each of the documents in the collection can be computed by any 
of the systems usually used for monolingual retrieval [5]. In our case we use the well- 
known vector model. 



C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 224-229, 2001. 
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2 Approach to the Prohlem 

For term-based IR teehniques, as is the case of the vector model, the terms 
represented in the documents and in the queries have to be put into the same 
language. In one way or another, in bilingual text retrieval this entails some type of 
translation, and finding a good translation system can solve the problem. 

In principle, it is a matter of translating individual terms, which does not seem to 
be as complicated as translating a syntactically structured text. However, the main 
problem, apart from the use of a machine-readable bilingual dictionary, lies in the 
disambiguation of the terms: these may have diverse meanings and each meaning may 
have diverse equivalents in the other language. It is not easy to determine the 
appropriate equivalents in each case and various methods have been proposed for this 
purpose [1]. The final result depends on the quantity and quality of the semantic 
knowledge contained in the dictionaries and word lists used. 

Thus, we shall not use the approach of translating terms, since this would lead to 
poorer results in retrieval. Translating systems find it easier to disambiguate and 
contextualize phrases [3], and this should give rise to better results. 

Hence, and because computationally it is simpler, the process followed was that of 
translating the queries to the language of the documents, and not the reverse. In our 
case, a very simple approach was adopted to solve the problem: that of using one of 
the commercial machine translation programs available. We did not expect great 
results, although it has allowed us a better understanding of the problem. 



2.1 Machine Translation 

Although machine translation (MT) is an area of intense research, there are already 
quite a few commercial programs on the market. These programs do not have much 
prestige, owing to the fact that the translations obtained often contain many mistakes 
and are sometimes linguistically unacceptable. However, we noted that the linguistic 
requirements of vector model based IR systems are not so great as those of the people 
who have to read and understand translations [4]. Indeed, many IR systems do not 
examine syntactical constructions and, when the terms are submitted to a stemming 
process, they disregard morphology. 

The use of one of these commercial MT systems does not present any difficulties. 
In our case, as we lack experience in bilingual retrieval, it seemed to be a good way to 
become introduced to the subject. This was our approach to the problem. 

Many MT systems also allow some kind of adaptation to the context, such as 
domain specific dictionaries, database for language pair translations, etc., which give 
better results in translation and, consequently, better results in retrieval. However, in 
our research, none of these additional tools was used. The simplest strategy was 
followed. 



3 The Experiment 

The layout of the process followed can be seen in the diagram below: 
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Fig. 1. Spanish-English Bilingual IR system 



3.1 Queries in Spanish 

We should point out that the queries were not pre-processed, i.e. they were not treated 
to eliminate terms that might introduce mistakes in subsequent retrieval. Three 
translation programs were applied directly to the queries in Spanish, without 
considering the noise that those terms not relevant to the query might introduce into 
the system. 

A future study will be carried out to find out how errors in the translation of the 
most significant terms in the queries affect information retrieval. We expect to find 
parallelism between the errors in translation of the queries and the retrieval results. 



3.2 Translation of Queries 

Three MT programs were used: Systrans (on-line vers, http://www.systransoft.com), 
Globalink Spanish Assistant vl.O and Globalink Power translator Pro v6.2. (at present 
the last two are products of Lemout & Hauspie). These programs are not expensive 
(the Systrans on-line version is free), and can be used on a PC with few resources. 

The reason for using three programs was to check the quality of the translation, 
and, consequently, to use the best of the three translations for retrieval. In no case 
were thematic or contextual dictionaries used. We used the complete topic set in 
Spanish, i.e. titles, descriptions and narratives, and input it to each of the three 
translation systems. 

The three systems tested produced very similar translations, and also coincided, 
notably, in the same errors. A study of the errors made by each gave very similar 
figures for all three. This study was carried out taking into account the significant 
terms for the retrieval of the original queries in English, contrasted with significant 
terms of the translations. The different terms were considered as translation errors, 
except in the cases of evident synonyms. One error was counted in those cases in 
which Spanish-English translation produced two or more terms, when in the English 
queries there was only one. Although this type of count is not very rigorous, it at least 
allows us to explore the possible differences between the three translation systems 
tested, from the point of view of information retrieval. 

The error percentages thus estimated were very similar for all three. The 
differences were very small, with the results obtained by Systran being slightly more 
favorable. Moreover, and more intuitively, the mere reading of the translations 
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showed that Systran seems to work better with proper nouns. It is better at detecting 
whether a word is a proper noun, and, when that name can be translated, it also 
translates it better. Thus, we opted to work with Systran. 



3.3 Translated Questions 

The translations obtained in the previous phase were processed following the normal 
retrieval process of the vector model: elimination of stop words, stemming and 
calculation of weight. 

The original queries in English underwent the same treatment. A comparison was 
made of the stems obtained for the queries translated and those obtained for the 
original queries in English. A discrepancy of around 28% was observed, i.e. over a 
quarter of the stems of the queries translated into English were different from the 
stems of the original questions in English. This does not necessarily mean that the 
stems obtained were incorrect, since in some cases the translations may have used 
synonyms, or semantically equivalent terms. 



3.4 IR System 

As a retrieval engine we used our own software, which we have called Karpant^[2]. 
This is a simple program based on the vector model, which was designed mainly for 
educational and not operational purposes. Owing to the large number of documents 
used in the experiment (113,000 documents, 400 MB of information) the operation 
process was frustratingly slow. This did not worry us at first, since the objective of 
our study was to verify the use of a simple approach to the problem: the application of 
an inexpensive MT system to CLIR. 

Before indexing the documents in English, stop words were eliminated in order to 
save index space. For this purpose a standard list of some 200 components was used. 
Remaining words were stemmed by applying Porter’s algorithm [PORTER80]. We 
used a Perl script with an implementation of this algorithm, which is widely diffused 
through CP AN [7]. Karpanta was then used to index all the documents in English, 
with all their fields. The weights of the stems obtained were calculated with the usual 
scheme of frequency of term in the document by IDF. 

The queries translated into English were processed in the same way. They were 
used as a whole, with title, description and narrative; stop words were eliminated and 
stems obtained whose weight was calculated in the same way. The solving of the 
queries, i.e. the computation of similarity between each query and each of the 
documents, was performed using the widely known cosine formula. 

The same process was also followed for the original queries in English, thus 
obtaining results, which have served as a reference point to establish comparisons 
with the results obtained after bilingual retrieval. 



' A legendary figure in Spanish comics, whose most outstanding characteristic was that of 
always being hungry. 
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It should be emphasized that in no case was relevance feedback used in our 
experiments, despite the fact that this would probably have given rise to much better 
results. 



4 Results 

The results obtained with the queries translated from Spanish gave a mean precision 
of 0.2273 and can be seen in the attached graph. However, the results varied over 
querries (standard deviation = 0.23). 

If we compare these results with those obtained using the original queries in 
English (mean precision of 0.27), the former are slightly lower. The precision-recall 
curves are almost parallel. 




Fig. 2. Spanish-English Bilingual Retrieval Comparison 



Moreover, if we observe each individual query, it can be seen that there are many 
parallels: the queries translated into English which give the best results coincide with 
the original queries in English that work best. Those with the worst results also show 
the same parallelism, both for the original queries in English and those translated into 
Spanish. 



5 Conclusions 

The use of a commercial MT system to solve bilingual retrieval tasks is an easy and 
swift solution, although effectiveness in retrieval is slightly below that obtained in 
monolingual results. The difference is around 15%, although this figure is less at low 
recall levels, i.e. taking into consideration only the first documents retrieved. 
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Queiy number 

Fig. 3. Difference in mean average precision from the English original queries set and 
translated one. 



No relevance feedback of queries was performed in our experiments, although this 
would probably have led to much better results. 

Future work will be done to find out how translation errors of significant terms for 
the retrieval affect the results. 
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Abstract. This paper describes an elementary bilingual information retrieval 
experiment. The experiment takes Dutch topics to retrieve relevant English 
documents using Microsoft SQL Server version 7.0. In order to cross the 
language barrier between query and document, the researchers use query 
translation by means of a machine-readable dictionary. The Dutch run was void 
of the typical natural language processing techniques such as parsing, 
stemming, or part of speech tagging. A monolingual run was carried out for 
comparison purposes. Due to limitations in time, retrieval system, translation 
method, and test collection, there is only a preliminary analysis of the results. 



1 Introduction and Problem Description 

Cross-Language Information Retrieval (CLIR) systems enable users to formulate 
queries in their native language to retrieve documents in foreign languages [1]. In 
CLIR, retrieval is not restricted to the query language. Rather queries in one language 
are used to retrieve documents in multiple languages. Because queries and documents 
in CLIR do not necessarily share the same language, translation is needed before 
matching can take place. This translation step tends to cause a reduction in cross- 
language retrieval performance as compared to monolingual information retrieval. 
The literature explores four different translation options: translating queries (e.g. [2], 
[3]), translating documents [4], [5], translating both queries and documents [6], and 
cognate matching Q[7]. The prevailing CLIR approach is query translation. 

The translation of queries is inherently difficult due to the lack of a one-to-one 
mapping of a lexical item and its meaning. This creates lexical ambiguity. Further, 
query translation is complicated by the cultural differences between language 
communities and the way they lexicalize the world around them. These two 
translation issues create many different translation problems such as lexical 
ambiguity, lexical mismatches, and lexical holes. In turn, these and other translation 
problems result in translation errors which impact CLIR retrieval performance. 



' Cognate matching facilitates matching cognates (words that have identical spelling) across 
languages by allowing for minor spelling differences between the cognates. 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 230-236, 2001. 
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The Cross-Language Evaluation Forum (CLEF) provides a multilingual test 
collection to study CLIR using European languages. One of the CLEF tasks is 
bilingual information retrieval. The aim of the bilingual task is the retrieval of 
documents in a language different from the topic (query) language. Unlike the 
multilingual task, only two languages are involved and retrieval results are 
monolingual. For the bilingual run we used the Dutch topic set (40 topics) to retrieve 
English documents (Los Angeles Times of 1994 - 113,005 documents, 409,600 KB). 
We were completely oblivious to CLEF and its deadlines but we happened to hear 
that CLEF results were due in one week. We immediately signed up and started on 
our mad rush to get results in on time. 



2 Experimental Setup 

In monolingual information retrieval experiments, researchers commonly vary the 
information retrieval system while keeping the test queries and documents constant. 
This allows for comparison between systems and comparison between different 
versions of the same system. The same practice is followed in CLIR experiments 
when comparing different systems. However, CLIR experiments vary the test queries 
rather than the system, to allow for comparison between the cross-language and 
monolingual capabilities of the same system. The experiments in this research rely on 
varying the test queries. 

By manually translating test queries into a foreign language and using these test 
queries as the cross-language equivalents, the cross-language performance of a system 
can be compared directly to its monolingual performance (see figure 1). Manual 
translation of queries is now a widely used evaluation strategy because it permits 
existing test collections to be inexpensively extended to any language pair for which 
translation resources are available. The disadvantage of this evaluation technique is 
that manual translation requires the application of human judgment, and evaluation 
collections constructed this way exhibit some variability based on the terminology 
chosen by a particular translator. 

The CLEF experiments described in this paper are modeled after the experiments 
described above. CLEF provided topic sets in both languages. Of these, we used only 
the descriptions and narratives. The English topics were pos-tagged to aid phrase 
detection and stopwords were filtered out using the SMART stop list. We wrote a 
crude perl program to convert the English query into a Boolean representation that 
was usable by the retrieval system (described in experimental setup). The Dutch 
topics were processed differently since we lacked Dutch text processing resources. 
For each query, we extracted individual tokens, treating each token separated by 
spaces as a single word. A dictionary lookup took place for each token and all 
possible translations with their parts of speech (nouns, adjectives, verbs, and adverbs 
only) were inserted into the query translation file. Words that lacked a translation 
were left untranslated. The translation file was converted into a logical representation. 
Translation synonyms were combined using the OR operator and phrases were added 
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Fig. 1. Bilingual CLIR system evaluation. 

using double quotes around the phrase. We assumed that capitalized translated tokens 
were important to the query and used the AND operator to add them to the logical 
representation (see table 1). 



Original topic 

<top> 

<num> C034 
<D-title> 

Alcoholgebruik in Europa 
<D-desc> 

Omvang van en redenen voor het gebruik van alcohol in Europa. 
<D-narr> 

Behalve algemene informatie over het gebruik van alcohol in 
Europa is ook - maar niet uitsluitend - informatie over 
alcoholmisbruik van belang. 

</top> 

Logical representation after translation (based on description 
and narrative) 

("Europe") AND ("alcoholgebruik" OR "dimension" OR 
"application" OR "alcohol" OR "general" OR "data" OR 
"exclusively" OR "advantage") 



Table 1. Query processing. 

Unfortunately our plain and simple approach was thwarted by the retrieval system 
which stumbled on our rather lengthy query representations. Since we only had hours 
to spare before we had to submit our results, we decided to drastically shorten our 
Dutch queries. The translations we used were grouped by part-of-speech so we 
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decided to pick only those translations listed under the very first part-of-speech. The 
queries were still too long so we further limited the translation to the first term within 
that part-of-speech (excluding all synonyms). Looking back, we should probably have 
limited our queries to the title fields rather than using the lengthy description and 
narrative but we ran out of time. It is not surprising that our results were a bit dismal 
(see Results). 



3 System Overview 

The system used in the experiments utilized the full-text support of Microsoft SQL 
Server version 7.0 [8]. SQL Server is a commercial relational database system. 
Besides regular relational operations, in version 7.0, it introduces facilities that allow 
full text indexing and searching of textual data residing in the server. Full-text search 
on database data is enabled by proprietary extensions to the SQL language. The 
following search methods are available in SQL Server 7.0: 

• search on words or phrases 

• search based on prefix of a word or phrase 

• search based on word or phrase proximity 

• search based on inflectional form of verb or adjective 

• search based on weight assigned to a set of words or phrases 

However, we only used the phrase and word or phrase proximity search functions 
in the experiments described in this paper. The system requires documents in the 
collection to be exported to the database before any indexing and searching can take 
place. Therefore, a table was created in SQL Server to represent the whole collection 
and each document in the collection was converted to a record in the table. The table 
was comprised of two columns: DOCNO and DOCTEXT. DOCNO served as the 
unique identification of each record in the table. DOCTEXT stored the text content of 
the documents. In the TREC collection, all documents are marked up in standard 
generalized mark up language (SGML) format. Elements like DOCNO, TITLE, 
AUTHOR, and TEXT for example, are used to mark up text segments and to indicate 
the semantics of that portion of text. Among those elements, text content of each 
document’s DOCNO element and the TEXT element was extracted and written into 
the table’s DOCNO and DOCTEXT columns respectively. Any SGML tags inside the 
TEXT elements were stripped out before the actual export took place. After the table 
was populated with textual data from the collection, a full-text index was created 
based on the table’s DOCTEXT column. 

After a query was sent to the system, a result set of document number, DOCNO, 
along with rank was returned. The rank was a value between 0 and 1000 which was 
generated by SQL Server to indicate how well a record matched the query. The 
results of each query were sorted by the system specific rank value in descending 
order and the 1,000 highest-ranking records were collected to generate the result 
submission file. For numerous queries the system retrieved less than 100 documents 
and in some cases nearly no documents at all. 
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4 Results 

As pointed out previously, our results were disappointing. Out of the 33 topics that 
had relevant documents, the Dutch-English multilingual run only retrieved relevant 
documents for approximately 70% (23) of them. The English monolingual run did 
slightly better retrieving relevant documents for approximately 76% (25). We believe 
that the low number of relevant documents for a large number of topics in the test 
collection has affected the average precision measure (see Analysis) and therefore 
report the following numbers with some reservation. Average precision is 0.0364 for 
our cross-lingual run and 0.0678 for our monolingual run. A recall-precision table 
will not be presented since we would have to change the scale to make it show 
anything meaningful. As well, the graph will not provide a fair representation. Our 
Boolean system failed to retrieve the full 1000 documents for a large number of 
queries (we retrieved a total of 24,571 documents out of a possible 33,000 for cross- 
lingual and 15,057 out of 33,000 for monolingual). 

In an effort to determine whether the problems we encountered were system based, 
we ran the identical set of queries on the Mirror DBMS system. The Mirror DBMS 
system combines information retrieval and data retrieval and uses statistical language 
models for information retrieval [9, 10]. The results improved drastically. For the 
cross-lingual run average precision improved by about 228% (new average precision 
0.1197). For the monolingual run average precision improved by about 435% (new 
average precision 0.3630) (see figure 2). Interestingly, the monolingual results had a 
much larger improvement. 



Recall precision graph 




Fig. 2. Interpolated recall-precision using the Mirror DBMS 
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5 Analysis 

The original results cannot just be blamed on the fact that most of the translations had 
to be removed to reduce the length of the queries (see Experimental Setup). Clearly, 
our monolingual results are also disappointing. We speculate that the lack of 
sophisticated linguistic processing, and techniques such as query expansion are 
reasons for our disappointing results. It is important to realize that the main reason for 
these results is the unsatisfactory retrieval capability of the commercial relational 
database used in the initial experiments. Additional experiments using the Mirror 
DBMS system show enormous performance improvements. 

There are, however, issues regarding the test collection used in these experiments 
that impacts the evaluation of the results. Many of the topics only have a very limited 
number of relevant documents. Out of 40 topics, 7 topics do not have any relevant 
documents and these topics were left out of the analysis. This left 33 topics. Out of 33 
topics 33% (11 documents) of documents have fewer than 10 relevant documents. 
And 18% of those (33 documents) have 5 or fewer relevant documents. The lack of 
relevant documents is problematic for measures such as average precision because 
averages are sensitive to large differences between numbers [11]. Topics 4 and 30, for 
example, only have 1 relevant document each. If this document is retrieved on rank 1 
precision is 1 but if it is retrieved at rank 2 precision drops to 0.5. Average precision 
is also very sensitive to queries that perform poorly and these are represented in 
greater abundance in CLIR where extra noise is added in the translation. To soften the 
impact of bad queries, a test collection should provide a larger number of topics to 
reduce the effect these queries might have. 33 topics alone might not be enough. 

The shortage of relevant documents also affects precision (X) measures. Hull [12] 
suggests using high precision measures for cross-language system evaluation because 
they best reflect the nature of CLIR. In an ad hoc cross-lingual search, users are less 
likely to go through large numbers of documents to assess their relevance since they 
are not likely to be proficient in the language. It is important therefore to rank relevant 
documents at a high level. In addition, cross-lingual searches tend to benefit 
substantially from relevance feedback since this adds new foreign language 
terminology to the query that might be lacking in the original search. Here too it is 
important to rank relevant documents highly. Precision (10) is a good indicator of a 
system’s ability to rank relevant documents highly. The problem with this test 
collection is that for 33% of the topics, a system could never have a perfect precision 
(10) score even if a system managed to retrieve all the relevant documents in the top 
10 . 



6 Future Work 

After a more careful analysis of the results described in this paper we plan on carrying 
out system testing exploring the system features more carefully. We plan on 
examining the translation from the query to the logical representation and the 
incorporation of query expansion and automatic relevance feedback. 
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Abstract. HyREX is the Hypermedia Retrieval Engine for XML. Its 
extensibility is based on the implementation of physical data indepen- 
dence; its query interface on the conceptual level consists of data types 
with respective vague search predicates. This concept enabled us to add 
search predicates for the data type text to do bilingual text retrieval. 
Our implementation uses free Internet resources for translating topics in 
English to German and vice versa. 



1 Introduction 

Typical Information Retrieval (IR) applications offer information to the user 
which consists of more than just plain text documents. Digital libraries for ex- 
ample do not only offer full texts of scientific publications but also metadata 
comprising bibliographic information as well as indexing information like e.g. 
subject descriptors or classification codes. Often markup languages like SGML 
or XML are used to expose the logical structure of documents on the one hand 
and the attribute structure of metadata on the other hand. 

This kind of fine grained markup of logical and attribute structure should be 
explored by IR systems in order to offer special search predicates for different 
types of data. For example, searching for person names like in an author attribute 
similarity search for proper names should be offered. These comprise not only 
string search but especially the possibility to search for phonetically similar 
names. Accordingly, not only predicates for testing equality should be offered 
for dates but also predicates like greater than, less than, or vague predicates like 
around date. 

HyRExQ the Hypermedia Retrieval Engine for XML, offers this kind of 
search predicate for different data types. Data types with their respective (vague) 
search predicates build the interface to the conceptual level and thus hide their 
implementation details on the physical (internal) level. The concept of data in- 
dependence is further explained in Section |21 Instead of treating the different 
data types as being independent of each other, it is more appropriate to use an 
inheritance hierarchy. This kind of relationship on data types is used to integrate 
bilingual IR mechanisms into HyREX (Section^). Translation of queries is done 

^ http : / /ls6-www . cs . uni-dortmund . de/ ir/pro j ects/hyrex/ 
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using rather naive dictionary and machine translation methods. In Section 0 
experiments with HyREX and the CLEF 2000 collections and their respective 
results are described. Section 0 gives a conclusion and an outlook on further 
work. 



2 Data Independence in HyREX 

The general idea underlying the concept of data independence is the following: 
By introducing several abstraction levels for data organisation, changes at a 
certain level do not affect the higher levels. For example, if an index on the 
physical level is added for speeding up certain types of queries, this should not 
affect the search operations on the conceptual level, except that some of them 
can then be processed more efficiently. 

In the ANSI / R3 / SPARC model ITsichritzis fe King 78| , originating from 
the database field, three levels of data organisation are distinguished: 

— The physical (internal) level deals with internal data and record formats and 
access structures. 

— On the conceptual level, the complete conceptual schema of the database is 
visible. However, physical data independence guarantees that any changes 
on the internal level do not affect any application addressing the conceptual 
level. 

— The external level provides specific views of the database by referring only 
to those relations and attributes that are needed by a specific application. 

When we designed HyREX, we adopted these concepts for data independence 
from the database field. HyREX deals with the physical level, that is access 
paths for efficient query processing are provided through a proper interface to 
the conceptual level. This leads to the following advantages: 

— Physical data independence: Search operations are independent from the 
availability of access paths. In many retrieval applications one can observe 
that this is not the case. Most systems only allow for queries which can be 
directly answered from an existing inverted file. In HyREX physical data 
independence is reached by different levels of index support. They range 
from scanning (no index available; queries are processed by directly scanning 
through the documents) over support structure to direct index (for example 
an inverted file for term searches). 

— Appropriate search operations are provided. For example, in most retrieval 
systems noun-phrase search is based on proximity operators. Here, the user 
has to decide for criteria which make up a phrase (e. g. distance and ordering 
of constituents of a phrase in the text of the documents) . The philosophy of 
HyREX in this case would be to hide such implementation details from the 
user. The user is provided with a specific search predicate for phrases, while 
the system internally decides how a phrase is defined. Of course HyREX’s 
decision might be based on criteria like distance and ordering but also more 
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enhanced methods can be implemented without affecting the user’s search 
interface. 

— Finally, the concept of data abstraction by means of different system levels 
helps to modularise the system. HyREX has an object-oriented design. 

In HyREX the search interface is made up of data types with vague pred- 
icates. The attribute structure of a given document base is therefore mapped 
onto a schema which assigns each attribute its respective data type: 

Schema := {ANamei : Datatypei , . . . ; AN amen '■ DatatypCn} 

For example a simple schema for a literature database could look like the 
following: 

Schema := {Author : PersonName; Content : Text] PubDate : Date} 




Fig. 1. General UML class diagram |Fowler Scott 97j of a data type: a data 
type aggregates one or more search predicates. These predicates are implemented 
by and therefore relate to one or more access path structures. 



A data type is made up by its domain (i. e. values comprising the data type) 
and appropriate (vague) search predicates, which can be applied to elements 
from the data type’s domain (a more formal view on data types and search 
predicates is given in |Fiihr 99j ). Figure H shows the general UML class diagram 
of a data type. The data type aggregates one or more search predicates. In the 
search predicates we separate the conceptual from the internal level: While the 
predicates make up the search interface from the conceptual level their imple- 
mentation by means of appropriate access paths or scanning is hidden on the 
physical level. Details of the implementation are given in |Fuhr et al. 98j. 

With the schema which assigns to each attribute a data type with the respec- 
tive predicates, one can formulate queries at the conceptual level. Such queries 
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basically are triples consisting of an attribute name, a predicate, and a compar- 
ison value. W. r.t. the schema above, for example the following queries can be 
issued: 

— Author sounds-like Norbert Fuhr asks for documents being authored by 
someone whose name sounds similar to Norbert Fuhr. 

— PubDate around-year 1999 asks for documents which have been published 
around year 1999. 

— Content contains-phrase probabilistic IR asks for documents dealing with 
the concept probabilistic IR. 

Instead of treating the different data types as being independent from each 
other it is more appropriate to use an inheritance hierarchy, i.e. data types can 
inherit from each other. A data type D' which is a specialisation of a data 
type D inherits all predicates of D and can be extended by more specific search 
predicates. A simple inheritance hierarchy is depicted in Figure 0 while for 
example the data type Textr.English inherits from its ancestors the predicates 
equal, contains, and contains-phrase it specialises the Text data type by 
the language dependent contains-normalised predicate, which provides for 
searching for word stems. 




contains-german 



Fig. 2. Inheritance hierarchy on data types. 



3 Search Predicates for Bilingual Retrieval 

Having a system which is extensible w. r. t. data types and their respective search 
predicates, we decided to extend the Text: : English and Text: : German data 
types by search predicates for bilingual text retrieval. These predicates had to 
perform the translation of topics and queries from German to English in case of 
data type Text : : English and vice versa in case of data type Text : : German. 
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For translation of queries we adopted two rather naive, but fully automatic 
approaches. In both approaches we used free Internet resources: 

— Approach I uses the Babeihsh translation serviced of Altavista. This service 
allows to translate passages in a source language to a given target language. 
Besides the translation from German to English and vice versa, Babelfish 
handles various other languages. 

— Approach 2 uses an ordinary online dictionary for word-by-word translations. 
We chose the Leo Dictionary serviced for this purpose. Leo provides for a 
English / German dictionary with about 223 900 entries. Translations can be 
done in both directions. Since compound words and phrases are also included 
in the dictionary, we exploited this by not translating the original topics 
word-by-word but by interpreting each two neighbouring terms as phrases. 
Adopting a really naive approach, we did not even attempt to tackle the 
word disambiguation problem. 



query result 

© “ 



bilingual 
search predicate 








Fig. 3. Bilingual search predicates, implemented using free Internet translation 
services. 



Figure 01 shows the general scheme of our search predicates for bilingual text 
retrieval. The user gives the query in a source language, which is translated by 
means of a translation wrapper. The task of the wrapper is to give a uniform in- 
terface to free translation resources on the Internet: It accepts the query as given 
by the user plus source and target language and then handles the translation 
through the service it was implemented for. 



^ http: //babelfish. altavista. com/ 
^ http : / /diet . leo . org/ 
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4 Experiments 

In order to evaluate our search predicates for bilingual retrieval in terms of effec- 
tiveness we used two document collections from the Cross-Language Evaluation 
for?/, 770, Both, the laMmes collection and the domain-specific GIRT collection 
come with topics in German and English; relevance judgements have been de- 
rived by judging the results of the CLEF 2000 participants. 

Both test collections have been indexed by HyREX; to build the proper access 
paths for the bilingual search predicates we applied language specific stop-word 
removal and stemming on the documents’ content. The well-known tf x idf 
scheme {Salton fe Buckley 88| has been applied for term weighting. 

For comparison we also performed monolingual retrieval runs on both col- 
lections. Effectiveness has been measured in terms of recall and precision. The 
results are presented by means of recall-precision curves and the average preci- 
sion w. r. t. 100 recall points. 



4.1 la_times 

The laMmes collection consists of 84 347 newspaper articles in English from the 
1994 Los Angeles TimetOI volume. 




Fig. 4. Effectiveness of bilingual retrieval with the laMmes collection 

http: //www. iei .pi . cnr . it /DELOS /CLEF/ 

® http://www.latimes.com/ 
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For bilingual retrieval on the la-times collection both, the Babelfish and 
the Leo approach have been used to translate the topics. Figure 01 shows the 
recall-precision curves resulting from the bilingual (German to English) and 
monolingual retrieval runs. The average precision is 11.23% for the bilingual 
run using the Babelfish approach, 5.44% for the bilingual run using the Leo 
approach, and 12.11% for the monolingual run. 



4.2 GIRT 

The GIRT (German Indexing and Retrieval Test database) collection contains 
76128 documents from the social sciences domain. The documents are in Ger- 
man and have been put together by IZ BonrQ Topics were given both in English 
and German. 




Fig. 5. Effectiveness of bilingual retrieval with the GIRT collection 



For bilingual retrieval on the GIRT collection, the English topics have been 
translated by the Babelfish approach. Figure El shows the recall-precision curves 
resulting from the bilingual (German to English) and monolingual retrieval runs. 
The average precision is 4.20 % for the bilingual run and 15.78 % for the mono- 
lingual run. 



http : / /www . bonn . iz-soz . de/ 
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4.3 Analysis 

The results show that bilingual retrieval implemented through free Internet 
translation services can be employed to domain-unspecific information retrieval 
applications. In case of the Babelfish approach together with German-to-English 
bilingual retrieval we yielded effectiveness which is comparable to the effective- 
ness reached by monolingual retrieval. However, the same approach did not per- 
form comparably well on the domain-specific GIRT collection. 

Gonsidering the Leo approach which used a word-by-word translation of the 
original topics one can say that this approach is too simplistic. Without any 
means for word disambiguation, a reasonable effectiveness could not be reached. 
During the translation process the size of the topics has grown by 92 % on average 
(on average the original topics contain 20.12 terms, while the topics translated 
by Leo consisted of 38.73 terms). 

5 Conclusion 

We have used HyREX for bilingual information retrieval. While the overall per- 
formance of the system in terms of effectiveness is rather low, we have shown 
that the system’s design and its flexibility allows us to extend it by cross-lingual 
IR methods. The architecture of HyREX, which provides data types with vague 
predicates as an query interface on the conceptual level, forms the basis for these 
extensions. 

Our next steps will aim at improving the retrieval effectiveness. More en- 
hanced methods for bilingual IR need to be implemented, especially for retrieval 
in domain-specific collections. Furthermore we would like to further extend the 
system in order to be able to also participate in multi-lingual retrieval tasks. 
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Abstract. We investigated dictionary based cross language information 
retrieval using lexical triangulation. Lexical triangulation combines the results 
of different transitive translations. Transitive translation uses a pivot language 
to translate between two languages when no direct translation resource is 
available. We took German queries and translated them via Spanish, or Dutch 
into English. We compared the results of retrieval experiments using these 
queries, with other versions created by combining the transitive translations or 
created by direct translation. Direct dictionary translation of a query introduces 
considerable ambiguity that damages retrieval, an average precision 79% below 
monolingual in this research. Transitive translation introduces more ambiguity, 
giving results worse than 88% below direct translation. We have shown that 
lexical triangulation between two transitive translations can eliminate much of 
the additional ambiguity introduced by transitive translation. 



1 Introduction and Background 

Cross Language Information Retrieval (CLIR) addresses the situation where the query 
that a user presents to an IR system, is not in the same language as the corpus of 
documents they wish to search. This situation presents a number of challenges 
(Grefenstette (1998)) but primary amongst these is the problem of crossing the 
language barrier (Schauble & Sheridan (1997)). Almost all the approaches to this 
problem require access to some form of rich translation resource to map terms in the 
query language (the source) to terms in the corpus (the target). “Transitive” CLIR 
aims to address the situation where there are limited direct translation resources 
available (Ballesteros (2000)). 

A transitive CLIR system translates the source language terms by first translating 
the terms into an intermediate or "pivot" language and then translating the resulting 
terms into the target language. Thus, a transitive system could translate a query from 
German to English via either Dutch, or Spanish. 

The main aim of this work is to combine translations from two different transitive 
routes to discover if this can reduce the ambiguity introduced by transitive translation. 
Ballesteros suggested the possibility of using this approach in the summary to her 
recent chapter (Ballesteros (2000)). We have chosen to call this approach “lexical 
triangulation”, see Figure 1. 



C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 245-252, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 
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Fig. 1. Lexical triangulation 

We have chosen to simulate a Machine-Readable Dictionary (MRD) approach to 
CLIR. This follows on from the work of Ballesteros & Croft (1996, 1997, 1998), and 
Ballesteros (2000). 



2 The Experimental Environment 

The underlying IR system used in the Sheffield submission was the GLASS system 
(Sanderson (2000)). 

The translation resources were derived from the German, Spanish, Dutch, and 
English components of EuroWordNet (Vossen (1999)). The data used to lemmatise 
the German queries was derived from the CELEX German databases. 



2.1 EuroWordNet 

Given that the intention of this work is to examine CLIR using simulated Machine 
Readable Dictionaries, the choice of EuroWordNet (Vossen (1999)) as the primary 
translation resource may appear a little strange. The primary basis for this choice was 
availabilitjjl 

The intention of the EuroWordNet project was to develop a database of WordNets 
for a number of European languages similar to, and linked with, the Princeton 
WordNet 1.5 (Vossen (1997)). This effectively makes English the inter lingua that all 
the other languages link through. One of the intended uses of EuroWordNet was in 
multi-lingual information retrieval (Vossen (1997)). Gonzalo, et al. (1998) describes 
a possible implementation. 

By developing a series of WordNets for European languages, and linking them to 
the original Princeton 1.5 WordNet for English, EuroWordNet has created a structure 
similar to the controlled vocabulary thesaurus used by Salton as described by Oard & 



1 



The Sheffield University Computer Science Department was a collaborator in the 
EuroWordNet project and Wim Peters of that department kindly made extracts from 
EuroWordNet available for this research. 
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Dorr (1996). The structure is also very similar to the structure developed by 
Diekema, et al. (1998). The Princeton WordNet consists of synonyms grouped 
together to form “synsets”, basic semantic relationships link these together to form the 
WordNet (Vossen (1997), Miller, et al. (2000)). Each synset has a unique identifier 
(synset-id). 

In EuroWordNet, the relationships between the onsets of the various component 
languages and the Princeton 1.5 WordNet synsetfl can take many forms. These 
include, for example, the eq_hyponyrtj| relation, which relates more general to more 
specific concepts (Vossen (1997)). 

Our work used EuroWordNet to generate structures to simulate a Machine 
Readable Dictionary. The only relationships used in the construction of the dictionary 
tables, were the eq_synonym and eq_near_synonym relationships. These are by far 
the most restrictive and precise of the possible relationships. 

The eq_synonym relationship records the fact that the language synset is 
synonymous with the WordNet synset. EuroWordNet introduced the 
eq_near_synonym relationship to record the fact that certain terms that share a 
common hypernym (more general concept) are closer in meaning than others. In this 
situation the co-hyponyms (more specific terms) that are closely related are close 
enough in meaning that they could be used for translation purposes, but are not 
synonymous and are therefore not in the same synset. This closeness is represented 
by linking the synsets with an eq_near_synonym relationship (Vossen (1997)). 

For each language used from EuroWordNet, two tables were generated. The first 
mapped lemmas to the synset-ids of the synsets related by eq_synonym or 
eq_near_synonym. The second maps synset-ids to their constituent lemmas (i.e. 
related by eq_synonym or eq_near_synonym). As we will explain below, these tables 
are used to parameterise the translation process. 



2.2 The Translation and Processing of Qneries 

Query processing was fully automatic and the queries were generated using all parts 
of the topics. The queries were passed through a series of processes as follows: 

• Parsing - The conversion of the topics to queries which makes use of title, 
description and narrative fields. 

• Nomialisation - all characters were reduced to the lower case unaccented 
equivalents (i.e. “O” reduced to “o” and “E” to “e” etc.) in order to maximise 
matching in both the lemmatisation and translation processes. 

• Lemmatisation - The various inflected forms of the query words were reduced 
to a canonical lemma form to enable matching with the German EuroWordNet 
translation resources. A table derived from the CELEX German database was 
used to determine the appropriate lemmat|| for a word form. German 
compound words were split using a simple algorithm. The algorithm looks for 



^ In EuroWordNet terms the Inter Lingual Index or ILL 

^ The relationships in EuroWordNet have names on the form eq_relationship_name the eq_ 
indicates that the relationship involves some degree of “equality”. 

The wordform to lemma table is a many-to-many mapping as a wordform may be a valid 
inflection of more than one lemma. 
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a series of word forms that will match with the whole compound. If such a 
complete match is found the corresponding lemmata of the word forms are 
returned. The algorithm takes account of the use of “s” as “glue” in the 
construction of German compounds. This approach was based on the 
description of the word reduction module in Sheridan & Ballerini (1996). All 
of the CELEX data was normalised to unaccented lower case for matching 
with the query words. 

• German Stop Word Removal - A stopword list, generated from the CELEX 
German database, was used to remove words in the query that carried little 
meaning and would otherwise introduce noise to the translation. The stop- 
word lists contain all of the German words marked as articles, pronouns, 
prepositions, conjunctions or interjections in the CELEX database. 

• Translation - The translation process used tables derived from EuroWordNet 
to translate between two languages. The lemma to synset-id table for the first 
language and the synset to lemma table for the second language were used to 
map words in the first language to words in the second. All the possible 
translations through the intermediate synset-ids were returned. Three different 
translations were created for each query: a direct German to English 
translation, a transitive translation using Spanish as the intermediate language, 
and a transitive translation using Dutch as the intermediate language. 

• Merging - The results of the two transitive translation routes were merged to 
produce a fourth translation, the triangulated translation. The merge process 
was conducted on an “original German Lemma” by “original German 
Lemma” basis. The translations from each route for each lemma were 
compared and only translations common to both routes were used to translate 
the lemma. 

• Retrieval - the translation and merging process produced four different 
versions of the queries translated into English, these were submitted to the 
GLASS IR system which had been used to index the English corpus. The 
GLASS system normalised both documents and queries to lower case, and 
removed any English stopwords (using a standard English stop word list). 
Porter stemming (Porter (1980)) was used on both the queries and the 
collection. No special processing was used on the corpus. 



3 The Experimental Story 

We submitted four official runs to the CLEF evaluation process. 

• A “bilinguaf’ run (shefbi), generated from the direct translation from German 
to English 

• A “Spanish transitive” run (shefes), generated from the transitive translation 
using Spanish as the intermediate. 

• A “Dutch transitive” run (shefnl), generated from the transitive translation 
using Dutch as the intermediate. 

• And a “triangulated” run (sheftri), generated from the result of merging of the 
two transitive translations. 
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• Only the triangulated run (sheftri) was judged and contributed to the relevance 
judgement pool. 

In order to provide a baseline for comparison we conducted an additional English 
monolingual run using the same parsing and retrieval processes. This unofficial run is 
presented below to enable comparisons to be made. 

In summary, the experimental conditions were as follows: 



Experimental Variable 

Queries 

Corpus 

Relevance Judgements 
Corpus and Query Stemming 
Lemmatiser 

German Stop-words removed 
pre-translation 

Translation 



Merging Strategy for 
Lexical triangulation 



Value for this experiment 

CLEF 2000 CLIR, German and English 

LA Times 1994- CLEF Collection 

CLEF 2000 pool 

Yes, Porter based 

Yes, including German Compound Splitting 
Yes, all articles, pronouns, prepositions, 
conjunctions or interjections from the CELEX 
German database. 

Simulated Dictionary based, using lookup-tables 
derived from EuroWordNet eq_synonym and 
eq_near_synonym relations. 

Only translations common to both transitive 
routes. 



3.1 Results 

The table below shows the average precision for the five runs that made up the CLEF 
experiment. Only the cross language runs were submitted to the CLEF, and of those, 
only the triangulated run contributed to the pooled results. 



Porter, 

Intersection 


English 


0.3593 


Bilingual (shefbi) 


0.0856 


Triangulated (sheftri) 


0.0458 


Spanish Transitive (shefes) 


0.0098 


Dutch Transitive (shefnl) 


0.007 



The standard 11 -point recall and precision curves for the five runs are shown 
below, the second graph shows only the four cross language runs. 



3.2 Analysis 

Comparing the average precision of the monolingual run with the bilingual run we see 
that the bilingual run is some 76%Qbelow the monolingual. This compares to the 



^ Statistically significant at the 0.01 level under both the sign and Wilcoxon tests. 
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■♦—English 



Bilingual 

-A— Triangulated 

Spanish 

Transitive 

Dutch 

Transitive 



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Bilingual 



-A— Triangulated 



Spanish 

Transitive 

■*— Dutch 
Transitive 



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 





60% below worst case reported by Ballesteros & Croft (1996) when considering word 
by word dictionary based Spanish to English CLIR. 

Taking next the two transitive runs, we observe a differential of -88% in the case 
of the Spanish transitive run and -92% in the case of the Dutch transitive run relative 
to the bilingual run. Both of these results are statistically significant at the 0.01 level 
under both the sign and Wilcoxon tests. These figures are in line with the -92% 
differentials reported by Ballesteros (2000) for transitive retrieval of Spanish - French 
CLIR with English as the pivot compared to Spanish - French direct translation. 

Comparing the triangulated run with the two transitive runs reveals the expected 
improvement in performance. The differentials for the two transitive runs relative to 
the triangulated run are -79% for the Spanish transitive run and -85% for the Dutch 
transitive. Both of these figures are statistically significant at the 0.01 level under 
both the sign and Wilcoxon tests. 
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There is also a statistically significant differential of -47% between the triangulated 
run and the bilingual in favour of the bilingual. This significance is at the 0.01 level 
under both the sign and Wilcoxon tests. 



4 Conclusion 

In summary, these results support the results of Ballesteros (2000) with respect to the 
behaviour of transitive translation in CLIR. They also support the hypotheses we set 
out to prove that lexical triangulation has the beneficial effect of improving the results 
from transitive translation in dictionary based CLIR. 

This work made use of relatively rich resources in the form of EuroWordNet. 
However, it remains to be seen if these results could be repeated using the poorer 
quality resources that are likely to be available for translating between less common 
pairs of languages. 

As Samuel Johnson said “Dictionaries are like watches; the worst is better than 
none, and the best cannot be expected to be quite true.” (Gendreyzig (2000)) 
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Abstract. West Group participated in the non-English monolingual re- 
trieval task for French and German. Our primary interest was to investi- 
gate whether retrieval of German or French documents was any different 
from the retrieval of English documents. We focused on two aspects: 
stemming for both languages and compound breaking for German. In 
particular, we studied several query formulations to take advantage of 
German compounds. Our results suggest that German retrieval is in- 
deed different from English or French retrieval, inasmuch as accounting 
for compounds can significantly improve performance. 



1 Introduction 

West Group’s first attempt at non-English monolingual retrieval was through its 
participation in Amaryllis-2 campaign. Our findings during that campaign were 
that there was little difference between French and English retrieval, once the 
inflectional nature of French was handled through stemming or morphological 
analysis. For CLEF-2000, our goal for French document retrieval was to investi- 
gate the impact of our stemming methods. We compare performing no stemming, 
stemming using an inflectional morphological analyzer, and stemming using a 
rule-based algorithm similar to Porter’s English stemmer. 

Our main focus, however, was German document retrieval. German intro- 
duced a new dimension to our previous work: compound terms. We set up our 
experiments to assess whether we could ignore compound terms, i.e., handle Ger- 
man retrieval like we handled French or English retrieval, or whether we could 
leverage the existence and decomposition of compounds. 

For both our French and German experiments, we relied on a slightly al- 
tered version of the WIN engine. West Group’s implementation of the inference 
network retrieval model liur m\- We used third-party stemmers to handle non- 
English languages. 

In the following, we briefly describe the WIN engine and its adaptation to 
non-English languages. We report our variants for German document retrieval in 
Section 0 Section^ describes experiments with stemming for French monolingual 
retrieval. 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 253-|2nni 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 



254 



Isabelle Moulinier, J. Andrew McCulloh, and Elizabeth Lund 



2 General System Description 

The WIN system is a full-text natural language search engine, and corresponds to 
West Group’s implementation of the inference network retrieval model. While 
based on the same retrieval model as the INQUERY system [KlGRfl2| . WIN 
has evolved separately and focuses on the retrieval of legal material in large 
collections in a commercial environment that supports both Boolean and natural 
language searches FTur94j . 

The WIN engine supports three types of document scoring: the document as 
a whole is scored; each paragraph is scored and the document score becomes the 
best paragraph score; the score of the whole document and the best paragraph 
score are combined. We used the following scoring approaches for our CLEF 
experiments: 

— German retrieval considered that a document was scored as a whole docu- 
ment; 

— French retrieval used an average of the whole document score and the best 
paragraph score. 

This choice was prompted by the amount of information available in the various 
collections. For instance, the French document collection provided more “para- 
graph” marked-up information^ than the German document collections did. 

We indexed non-English collections using a slightly modified version of WIN 
for each language: 

— Indexing German documents used a third-party stemmer based on a mor- 
phological analyzer. One feature was compound decomposition: forcing de- 
composition or not was a parameter in our experiments. Additionally, we 
indexed both German collections provided by GLEF as one single retrieval 
collection and did not investigate merging retrieved sets 

— Indexing French documents required adding a tokenization rule to handle 
elision, and investigated two kinds of stemmers: a third-party stemmer based 
on a morphological analyzer, and a rule-based stemmer (a la Porter) from 
the Muscat project. 

A WIN query consists of concepts extracted from natural language text. Nor- 
mal WIN query processing eliminates stopwords, noise phrases (or introductory 
phrases) and recognizes phrases or other important concepts for special han- 
dling. Many of the concepts ordinarily recognized by WIN are specific to both 
English documents and the legal domain. To perform these tasks, WIN relies 
on various resources: a stopword list, a list of introductory phrases (“Find cases 
about. . . “A relevant document describes. . . ”), a dictionary of (legal) phrases. 

Query processing for French was similar to English query processing. We 
used a stopword list of 1745 terms (highly frequent terms, and noise terms like 
adverbs). For noise phrases, we used the TREG-6, 7 and 8 topics and refined the 
list of introductory patterns we created for Amaryllis-2. In the end, there were 

^ We considered the element TEXT as a paragraph delimiter. 
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160 patterns (a pattern is a regular expression that handles case variants and 
some spelling errors). We did not use phrase identification for lack of a general 
French phrase dictionary. 

For German, we investigated several options for structuringthe queries, de- 
pending on whether compounds were decomposed or not. This specific process- 
ing is described in Section 01 We used a stopword list of 333 terms. Using the 
TREC-6, 7 and 8 topics, we derived a set of introductory patterns for German. 
There were 11 regular expressions, summarizing over 200 noise phrases. We did 
not identify phrases using a phrase dictionary. However, in some experiments, 
German compounds have been treated as “natural phrases” . 

Finally, we extracted concepts from the full topics. However, we gave more 
weight to concepts appearing in the Title or Description fields than concepts 
extracted from the Narrative field. Following West’s participation at TREG3 
^’TYb95| . we assigned a weight of 4 to concepts extracted from the Title field, 
while concepts originating from the Description and Narrative fields were given 
a weight of 2 and 1, respectively. 



3 German Monolingual Retrieval Experiments and 
Results 



Our experiments with monolingual German retrieval focused on query processing 
and compound decomposition. Our submitted runs rely on decomposing com- 
pounds, but we also experimented with no decomposition, and no stemming at 
all. Indexing followed the choice made for query processing. For instance, when 
no decomposition was performed for query terms, parts of compounds were not 
indexed, only the compound term was. 

When we decided to break compound terms, we faced the choice of consid- 
ering a compound term as a single concept in our WIN query, or treating the 
compound as several concepts (as many concepts as there were parts in the com- 
pound). The submitted run WESTggl considers that a compound corresponds 
to several concepts; the run WESTgg2 handles a compound as a single concept. 

Given the compound Windenergie, the structured query in WESTggl intro- 
duces 2 concepts. Wind and Energie; the structured query in WESTgg2 intro- 
duces 1 concept, #PHRASE(Wind Energie). The ^PHRASE operator is a soft 
phrase, i.e. the component terms must appear within 3 words of one another. The 
score of the #PHRASE concept in our experiment was set to be the maximum 
score of either the soft phrase itself or its components. 

Table 0 summarizes the results of our two official runs, as well as the results 
of the runs NoStem where no stemming was used and NoBreak where stemming 
but no decomposition was used. 

The results reported in Table [D support the hypothesis that German docu- 
ment retrieval differs from English document retrieval. Decomposing compound 
words, regardless of the query structure, significantly improves the performance 
of our German retrieval system. Stemming on its own, however, only marginally 
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Table 1. Summary of individual run performance on the 37 German topics with 
relevant documents 





Performance of individual 


queries 


Run 


Avg. Free. 


R-Prec. 


Best 


Above 


Median 


Below 


Worst 


WESTggl 


0.3840 


0.3706 


3 


21 


3 


9 


1 


WESTgg2 


0.3779 


0.3628 


3 


18 


6 


9 


1 


NoBreak 


0.2989 


0.3141 


0 


18 


1 


15 


3 


NoStem 


0.2986 


0.3080 


0 


15 


1 


19 


2 



improves retrieval performance, as can be observed in Figure Q for runs NoBreak 
and NoStem. 

We expected a greater difference between our two submitted runs. WESTggl 
allows compound terms to contribute more to the score of a document, while 
WESTgg2 gives the same contribution to compound and non-compound terms. 
The contribution of a compound term in WESTggl is weighted by the number 
of parts in the compound, so one would expect its occurrence in a document to 
significantly alter a document score. 

After reviewing individual queries, we noticed the following behavior. First, 
for those queries where both the compounds and their parts had an average 
document frequency (more precisely idf), i.e., were neither particularly com- 
mon nor particularly rare, WESTggl and WESTgg2 behaved similarly. In that 
case, parts helped locate documents, but did not add to or draw away from the 
document relevance score. Second, for those queries where the compound itself 
was a rather common term in the collection, but where the individual parts 
were average, then the weighted contribution of the parts provided in WESTggl 
performed better. This case reflects the frequent use of a compound term as a 
single entity, with limited use of its constituing parts in the collection. Third, 
for those queries where at least one part of a compound was very common, the 
high occurrence of that part degraded the weighting scheme of WESTggl, thus 
the single concept construct of WESTgg2 provided a more representative score. 

Finally, compound handling in WESTggl as well as WESTgg2 is only as 
influential as there are compounds in the query. In the 40 German topics, 
roughly 16% of the query terms are compound terms. The difference between 
runs WESTgg2 (decomposing using soft phrases) and NoBreak can hardly be 
explained by these 16% or the soft phrase construct. The difference is the result 
of the indexing process. Because run WESTgg2 used compound breaking during 
indexing (while NoBreak did not), we were able to match a non-compound query 
term with part of a compound in indexed documents. 

4 French Monolingual Retrieval Experiments and Results 

The goal of our experiments with French document retrieval was to assess the 
difference between stemming algorithms. Our motivation was to further inves- 
tigate the particularity of French compared to English. From the various kinds 
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Recall 



Fig. 1. Recall/precision curves for the 4 runs: WESTggl and WESTgg2 use 
compound decomposition, NoBreak uses stemming but no decomposition, and 
NoStem uses raw forms. 



of stemming approaches used for English document retrieval in MSI , we have 
studied two types of stemmers as well as no stemming at all: 

— a rule-based stemmer “a la Porter” that approximates mainly inflectional 
rules, but also provides a limited set of derivational rules based on suffix 
stripping, e.g. it strips suffixes like -able or -isme; 

— a stemmer based on an inflectional morphological analyzer, e.g., it conflates 
verb forms to the infinitive of the verb, noun forms to the singular noun, 
adjectives to the masculine singular form. This stemmer is based on a lexicon. 
As this stemmer does not resolve morphological ambiguities, several stems 
may correspond to a term. For instance, porte may stem to porte (noun) 
and porter (verb)). 

In the runs using the inflectional stemmer, we investigated various ways of 
handling the multiple stems generated for a single term. WESTff, our submitted 
run, relied on selecting a single stem per term (the first stem in lexical order). 
The run labelled MultiStem kept the multiple stems, and used the structured 
query to group those multiple stems into a single concept. The results reported 
here grouped multiple stems under a #syn operatoi0. We also ran experiments 
using a Porter-like stemmer and no stemming at all. 

^ We also tried grouping multiple stems under a ^sum operator. We found no signif- 
icant difference between the two approaches. 
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Table 2. Summary of individual run performance on the 34 French topics with 
relevant documents 





Performance of individual 


queries 


Run 


Avg. Prec. 


R-Prec. 


Best 


Above 


Median 


Below 


Worst 


WESTff 


0.4903 


0.4371 


11 


9 


7 


7 


0 


MultiStem 


0.4964 


0.4352 


73 


16 


1 


10 


0 


Porter 


0.4680 


0.4297 


6 ^ 


14 


1 


13 


0 


Nostem 


0.4526 


0.4210 


7 ^ 


8 


0 


19 


0 



Table 121 summarizes experimental results while Figure |2I presents recall/pre- 
cision curves for our French runs. 

The slight difference between our submitted run WESTff and the run Multi- 
Stem can be explained by WESTfPs arbitrary decision of picking the first stem: 
for instance, opinion and opinions do not stem to the same form; the former 
stems to opinion, and the latter to opiner. 

While we usually consider not stemming as a baseline, our tests showed that 
the basline performed better on several topics. In those instances, we found that 
the Porter stemmer was too aggressive and stemmed important query terms to 
very common forms. For instance, parti was stemmed to part, directive to 
direct and f rangais to franc. The inflectional stemmer did exactly what it was 
supposed to do, e.g. stem frangaise to frangais. However, certain stems were 
very common, while their raw form was less common. For a couple of queries, 
the Porter stemmer performed better than the inflectional stemmer. This reflects 
one limitation of the inflectional stemmer: it is only as good as its lexicon. For 
instance, the Porter stemmer performed better on topic 23 because important 
query terms were stemmed menopausiques and menopausee to the same form 
menopaus, while the inflectional stemmer failed to stem menopausiques because 
it was not in its lexicon. 

Finally, we ran a manual run to determine whether phrase identification, e.g., 
academie frangaise, and monnaie europeenne, was likely to improve perfor- 
mance, just as it has proven to be beneficial for the English version of the WIN 
search engine. We observed a slight improvement (average precision: 0.4994; R- 
precision: 0.4427) over not using phrases. 

While our analysis is only partial at this time, our French stemming results 
follow the patterns exhibited by [IHu19Bj for English stemming, except that in- 
flectional stemming appears slightly superior. We do not know yet whether this 
is a particularity of the French language or of this particular collection and set 
of topics. 



^ For some queries, these runs achieved an average precision that was higher than the 
best average precision reported at CLEF-2000. 
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Recall 



Fig. 2. Recall/precision curves for our 4 runs: WESTff and MultiStem use the 
inflectional stemmer, Porter uses a Porter-like stemmer and NoStem uses raw 
forms. 



5 Summary 

The WIN retrieval system achieved good performance for both German and 
French document retrieval without any major modification being made to its 
retrievai engine. On the one hand, we showed that German document retrievai 
required special handling because of the use of compound words in the language. 
Our results showed that decomposing compounds during indexing and query pro- 
cessing enhanced the capabilities of our system. Our French experiments, on the 
other hand, did not uncover any striking difference between French and English 
retrievai, except a performance improvement due to the use of an inflectional 
stemmer. 
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Abstract. This paper presents work on document retrieval for Italian 
carried out at ITC-irst. Two different approaches to information retrieval 
were investigated, one based on the Okapi weighting formula and one 
based on a statistical model. Development experiments were carried out 
using the Italian sample of the TREC-8 CLIR track. Performance evalu- 
ation was done on the Cross Language Evaluation Forum (CLEF) 2000 
Italian monolingual track. The two methods achieved mean average pre- 
cisions of 49.0% and 47.5%, respectively, which were the two best scores 
of their track. 



1 Introduction 

This paper reports on Italian text retrieval research that has recently started at 
ITC-irst. Experimental evaluation was carried out in the framework of the Cross 
Language Evaluation Forum (CLEF), a text retrieval system evaluation activity 
coordinated in Europe from 2000, in collaboration with the US National Insti- 
tute of Standards and Technology (NIST) and the Text REtrieval Conference 
(TREC). 

ITC-irst has began to develop monolingual text retrieval systems m for the 
main purpose of accessing broadcast news archives This paper presents two 
Italian monolingual text retrieval systems that have been submitted to CLEF 
2000: a conventional Okapi derived model, and a statistical retrieval model. 
After the evaluation, a combined model was also developed that just integrates 
the scores of the two basic models. This simple and effective model shows a 
significant improvement over the two single models. 

The paper is organized as follows. In Section 2, the text preprocessing of 
documents and queries is presented. Section 3 and 4 introduce the text retrieval 
models that were officially evaluated at CLEF and present experimental results. 
Section 5 discusses improvements on the basic models that were made after the 
CLEF evaluation. In particular, a combined retrieval model is introduced and 
evaluated on the CLEF test collection. Finally, Section 6 offers some conclusions 
regarding the research at ITC-irst in the field of text retrieval. 

C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 261 -^^ 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 



262 



Nicola Bertoldi and Marcello Federico 



2 Text Preprocessing 

Document and query preprocessing implies several stages: tokenization, mor- 
phological analysis of words, part-of-speech (POS) tagging of text, base form 
extraction, stemming, and stop-terms removal. 



Tokenization. Tokenization of text is performed in order to isolate words from 
punctuation marks, recognize abbreviations and acronyms, correct possible word 
splits across lines, and discriminate between accents and quotation marks. 



Morphological Analysis. A morphological analyzer P] decomposes each Ital- 
ian inflected word into its morphemes, and suggests all possible POSs and base 
forms of each valid decomposition. By base forms we mean the usual not inflected 
entries of a dictionary. 



POS Tagging. POS tagging is based on a Viterbi decoder that computes the 
best text-POS alignment on the basis of a bigram POS language model and a 
discrete observation model jS|. The employed tagger works with 57 tag classes 
and has an accuracy around 96%. 



Base Form Extraction. Once the POS and the morphological analysis of each 
word in the text is computed, a base form can be assigned to each word. 



Stemming. Word stemming is applied at the level of tagged base forms. POS 
specific rules were developed that remove suffixes from verbs, nouns, and adjec- 
tives. 



Stop- Terms Removal. Words in the collection that are considered non rele- 
vant for the purpose of information retrieval are discarded in order to save index 
space. Words are filtered out on the basis either of their POS or their inverted 
document frequency. In particular, punctuation is eliminated together with arti- 
cles, determiners, quantifiers, auxiliary verbs, prepositions, conjunctions, inter- 
jections, and pronouns. Among the remaining terms, those with a low inverted 
document frequency, i.e. that occur in many different documents, are eliminated. 



Table □ collects statistics about the effects of text preprocessing steps on 
the mean document length (/), global vocabulary size (P), and mean document 
vocabulary size (Vd). 

An example of text preprocessing is presented in the appendix at the end of 
this paper. 
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Table 1. Effect of text preprocessing steps on the mean document length (/), 
global vocabulary size (E), and mean document vocabulary size {Vd). 



Terms 


Stop 


1 


V 


Vd 


text 


no 


225 


160K 


134 


base forms 


no 


225 


126K 


129 


stems 


no 


225 


lOlK 


126 


base forms 


yes 


103 


125K 


80 


stems 


yes 


103 


lOOK 


77 



3 Information Retrieval Models 



3.1 Okapi Model 

Okapi PI is the name of a retrieval system project that developed a family of 
weighting functions in order to evaluate the relevance of a document d versus a 
query q. In this work, the following Okapi weighting function was applied: 

s{d) = ^ fq(w)cd{w)idf(w) (1) 

w^qnd 



where: 



Cd{w) 



fd{w){ki + 1) 

hil-b) + kibdf + fd{w) 



scores the relevance of w in d, and the inverted document frequency: 



idf{w) = log 



N-N^ + 0.5 
+ 0.5 



( 2 ) 

( 3 ) 



evaluates the relevance of w inside the collection. The model implies two pa- 
rameters ki and b to be empirically estimated over a development sample. An 
explanation of the involved terms can be found in P) and other papers referred 
in it. 



3.2 Statistical Model 

A statistical retrieval model was developed based on previous work on statistical 
language modeling |2]. 

The match between a query q and a document d can be expressed through 
the following conditional probability distribution: 



P{d I q) = 



P{q I d)P{d) 
P{q) 



( 4 ) 



where P{q \ d) represents the likelihood of q, given d, P{d) represents the a- 
priori probability of d, and P{q) is a normalization term. By assuming no a-priori 
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Table 2. Notation used in the information retrieval models. 



fd{w) 


frequency of word w in document d 


fqiw) 


frequency of w in query q 


/(W) 


frequency of w in the collection 


fd 


length of document d 


f 


length of the collection 


1 


mean document length 


N 


number of documents 




number of documents containing w 


Vd 


vocabulary size of document d 


Vd 


average document vocabulary size 


V 


vocabulary size of the collection 



knowledge about the documents, and disregarding the normalization factor, doc- 
uments can be ranked, with respect to q, just by the likelihood term P{q \ d). 
If we interpret the likelihood function as the probability of d generating q and 
assume an order-free multinomial model, the following log-probability score can 
be derived: 

\ogP{q \d) = Y^ fq{w) log P{w I d) (5) 

wGq 

The probability that a term w is generated by d can be estimated by applying 
statistical language modeling techniques. Previous work on statistical informa- 
tion retrieval iEi proposed to interpolate relative frequencies of each document 
with those of the whole collection, with interpolation weights empirically esti- 
mated from the data. 

In this work we use an interpolation formula which applies the smoothing 
method proposed by nn. This method linearly smoothes word frequencies of 
a document and the amount of probability assigned to never observed terms is 
proportional to the number of different words contained in the document. Hence, 
the following probability estimate is applied: 



P{w I d) 



fdjw) 
fd + Vd 



Vd 

fd + Vd 



P{w) 



( 6 ) 



where P{w), the word probability over the collection, is estimated by interpo- 
lating the smoothed relative frequency with the uniform distribution over the 
vocabulary V: 



P{w) 



fH V 1 

f+v f+vv 



( 7 ) 



3.3 Blind Relevance Feedback 

Blind relevance feedback (BRF) is a well known technique that allows to improve 
retrieval performance. The basic idea is to perform retrieval in two steps. First, 
the documents matching the original query q are ranked, then the B best ranked 
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documents are taken and the T most relevant terms in them are added to the 
query. Hence, the retrieval phase is repeated with the augmented query. In this 
work, new search terms are extracted by sorting all the terms of the B top 
documents according to 0 : 

(r^ + 0.5) (iV -N^-B + r^ + 0.5) 
(A^„-r„+0.5)(B-r„+0.5) 

where is the frequency of word w inside the B top documents. 



4 Experiments 



This section presents work done to develop and test the presented models. Devel- 
opment and testing were done on two different Italian document retrieval tasks. 
Performance was measured in terms of Average Precision (AvPr) and mean Av- 
erage Precision (mAvPr). Given the document ranking provided against a given 
query g, let ri < . . . < be the ranks of the retrieved relevant documents. The 
AvPr for q is defined as the average of the precision values achieved at all recall 
points, i.e.: 



AvPr = 100 X i V 

fC 

2=1 



(9) 



The mAvPr of a set of queries corresponds to the mean of the corresponding 
query AvPr values. 



Table 3. Development and test collection sizes. 



Data Set 


# docs 


Avg. # 
words/ doc 


CLIR - Swiss News Agency 


62,359 


225 


CLEF - La Stampa 


58,051 


552 



4.1 Development 

For the purpose of parameter tuning, development material made available by 
CLEF was used. The collection consists of the test set used by the 1999 TREC-8 
CLIR track and its relevance assessments. The CLIR collection contains topics 
and documents in four languages: English, German, French, and Italian. The 
Italian part consists of texts issued by the Swiss News Agency {Schweizerische 
Depeschenagentur) from 17-11-1989 until 12-31-1990, and 28 topics, four of which 
have no corresponding Italian relevant document^. More details about the de- 
velopment collection are provided in Tables 0 0 and El 



^ CLIR topics without Italian relevant documents are 60, 63, 76, and 80. 
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Table 4. Topic statistics of development and test collections. For development 
and evaluation, queries were generated by using all the available topic fields. 



Data Set (topic #s’) 


Min 


# of Words 
Max Avg. 


Total 


CLIR (54-81) 


41 


107 


70.4 


1690 


title 


3 


8 


5.1 


122 


description 


8 


27 


17.1 


410 


narrative 


25 


81 


48.3 


1158 


CLEF (1-40) 


31 


96 


60.8 


2067 


title 


3 


9 


5.3 


179 


description 


7 


35 


15.7 


532 


narrative 


14 


84 


39.9 


1356 



Table 5. Document retrieval statistics of development and test collections. 



^ of Relevant Docs 

Data Set (topic #’s) Min Max Avg. Total 

CLIR (54-81) 2 15 7A TW 

CLEF (1-40) 1 42 9.9 338 



4.2 Okapi Tuning 

Tuning of the parameters in formula © was carried out on the development data. 
Queries were generated by using all the available topic fields. In Figure Q] a plot 
of the mAvPr versus different values of the parameters fci and b is shown. Finally, 
the values ki = 1.5 and b = 0.4 were chosen, because they provided consistently 
good results also with other evaluation measures. The achieved mAvPr is 46.1%. 

4.3 Blind Relevance Feedback Tuning 

Tuning of BRF parameters B and T was carried out just for the Okapi model. 
In Figure 0 a. plot of the mAvPr versus different values of the parameters is 
shown. Finally, the number of relevant documents B = 5 and the number of 
relevant terms T = 15 were chosen, whose combination gives a mAvPr of 49.2%, 
corresponding to a 6.8% improvement over the first step. 

Further work was done to optimize the performance of the first retrieval 
step. Indeed, performance of the BRF procedure is determined by the precision 
achieved, by the first retrieval phase, on the very top ranking documents. In par- 
ticular, an higher resolution for documents and queries was considered by using 
base forms instead of stems. In Table El mAvPr values are shown by considering 
different combinations of text preprocessing before and after BRF. In particular, 
we considered using base forms before and after BRF, using word stems before 
and after BRF, and using base forms before BRF and stems after BRF. The last 
combination achieved the largest improvement (8.6%) and was adopted for the 
final system. 



ITC-irst at CLEF 2000: Italian Monolingual Track 



267 



mAvPr 




Fig. 1. Mean Average Precision versus different settings of Okapi formula’s pa- 
rameters k\ and b. 



mAvPr 




Fig. 2. Mean Average Precision versus different settings of blind relevance feed- 
back parameters B and T. 
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Table 6. Mean Average Precision by using base forms (ba) or word stems (st) 
before (I) and after (II) blind relevance feedback (with B=5). 



# of relevant terms T 



I 


II 


5 


10 


15 


20 


25 


30 


st 


st 


46.4 


47.3 


49.2 


49.6 


48.3 


48.5 


ba 


ba 


46.2 


47.6 


47.6 


47.6 


47.7 


47.3 


ba 


st 


46.7 


48.7 


50.0 


48.5 


48.6 


48.6 



4.4 Official Evaluation 



The two presented models were evaluated on the CLEF 2000 Italian monolin- 
gual track. The test collection consists of newspaper articles published by La 
Stampa, during 1994, and 40 topics. As six of the topics do not have correspond- 
ing documents in the collection they are not taken into account^ Also for the 
evaluation, all the available topic fields were used to generate the queries. More 
details about the CLEF collection and topics are in Tables andini 



0 5 10 15 20 25 30 35 40 




Topic Number 



Fig. 3. Difference (in average precision) from the median for each of the 34 topics 
in the CLEF 2000 Italian monolingual track. Moreover, the best AvPr reference 
is plotted for each topic. 



^ CLEF topics without Italian relevant documents are 3, 6, 14, 27, 28, and 40. 
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Official results of the Okapi and statistical models are reported in Figure 0 
with the names irstl and irst2, respectively. Figure 0shows the difference in 
AvPr between each run and the median reference provided by the CLEF organi- 
zation. As a further reference, performance differences between the best result 
of CLEF and the median are also plotted. The mAvPr of irstl and irst2 are 
49.0% and 47.5%, respectively. Both methods score above the median reference 
mAvPr, which is 44.5%. The mAvPr of the median reference was computed by 
taking the average over the median AvPr scores. 



5 Improvements 

By looking at Figure 0 it emerges that the Okapi and the statistical model 
have quite different behaviors. This would suggest that if the two methods rank 
documents independently, some information about the relevant documents could 
be gained by integrating the scores of both methods. 

In order to compare the rankings of two models A and B, the Spearman’s 
rank correlation can be applied. Given a query, let r{A(d)) and represent 

the ranks of document d given by A and B, respectively. Hence, Spearman’s rank 
correlation 0 is defined as: 



^^[r{A{d)) - r{B{d))f 

d 

N{N^ - 1 ) 



(10) 



Under the hypothesis of independence between A and B, S has mean 0 and 
variance 1 / {N — 1) . On the contrary, in case of perfect correlation the S statistics 
has value 1. 

By taking the average of S over all the queries 0 , a rank correlation of 0.4 
resulted between irstl and irst2. 

This results confirms some degree of independence between the two informa- 
tion retrieval models. Hence, a combination of the two models was implemented 
by just taking the sum of scores. Actually, in order to adjust scale differences, 
scores of each model were normalized in the range [0, 1] before summation. By 
using the official relevance assessments of CLEF, a mAvPr of 50.0% was achieved 
by the combined model. 

In Figure Eland Figure Eldetailed results of the combined model (merge) are 
provided for each query, respectively, against the CLEF references and irstl 
and irst2. It results that the combined model performs better than the median 
reference on 24 topics of 34, while irstl and irst2 improved the median AvPr 16 
e 17 times, respectively. Finally, the combined model improves the best reference 
on two topics (20 and 36). 

^ As an approximation, rankings were computed for the union of the 100 top docu- 
ments retrieved by each model. 
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Table 7. Performance of retrieval models on the CLEF 2000 Italian monolingual 
track. 



Retrieval Model 


Official Run 


mAvPr 


Okapi 


irstl 


49.0 


Statistical model 


irst2 


47.5 


Combined model 


- 


50.0 



6 Conclusion 

This paper presents preliminary research results by ITC-irst in the field of text 
retrieval. Nevertheless, participation to the CLEF evaluation has been consid- 
ered important in order to gain experience and feedback about our progress. 
Future work will be done to improve the statistical retrieval model, develop a 
statistical blind relevance feedback method, and develop a statistical model for 
cross-language retrieval. 
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Table 8. Example of text preprocessing. The flag in the last column indicates 
if the term survives or not after the stop-terms removal. The two POSs marked 
with ^ are wrong, nevertheless they permit to generate correct base forms and 
stems. 



Text 


POS 


Base form 


Stem 


R 


IL 


RS 


IL 


IL 


0 


PRIMO 


AS 


PRIMO 


PRIM 


1 


MINISTRO 


SS 


MINISTRO 


MINISTR 


1 


LITUANO 


AS 


LITUANO 


LITUAN 


1 




XPW 


> 




0 


SIGNORA 


SS 


SIGNORA 


SIGNOR 


1 


KAZIMIERA 


SPN 


KAZIMIERA 


KAZIMIER 


1 


PRUNSKIENE 


SPN 


PRUNSKIENE 


PRUNSKIEN 


1 




XPW 






0 


HA 


#VI# 


AVERE 


AVERE 


0 


ANGORA 


B 


ANGORA 


ANGORA 


0 


UNA 


RS 


UNA 


UNA 


0 


VOLTA 


SS 


VOLTA 


VOLT 


1 


SOLLEGITATO 


VSP 


SOLLECITARE 


SOLLECIT 


1 


OGGI 


B 


OGGI 


OGGI 


0 


UN 


RS 


UN 


UN 


0 


RAPIDO 


#SS# 


RAPIDO 


RAPID 


1 


AVVIO 


SS 


AVVIO 


AVVIO 


1 


DEI 


EP 


DEI 


DEI 


0 


NEGOZIATI 


SP 


NEGOZIATO 


NEG 


1 


CON 


E 


CON 


CON 


0 


L’ 


RS 


L’ 


L’ 


0 


URSS 


YA 


URSS 


URSS 


1 




XPW 


> 


5 


0 


RITENENDO 


VG 


RITENERE 


RITEN 


0 


FAVOREVOLE 


AS 


EAVOREVOLE 


FAVOR 


1 


L’ 


RS 


L’ 


L’ 


0 


ATTUALE 


AS 


ATTUALE 


ATTUAL 


1 


SITUAZIONE 


SS 


SITUAZIONE 


SIT 


1 


NEI 


EP 


NEI 


NEI 


0 


RAPPORTI 


SP 


RAPPORTO 


RAPPORT 


1 


FRA 


E 


ERA 


FRA 


0 


MOSCA 


SPN 


MOSCA 


MOSC 


1 


E 


C 


E 


E 


0 


VILNIUS 


SPN 


VILNIUS 


VILNIUS 


1 
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Abstract. We employ Automorphology, an MDL-based algorithm that 
determines the suffixes present in a language-sample with no prior knowledge 
of the language in question, and describe our experiments on the usefulness of 
this approach for Information Retrieval, employing this stemmer in a SMART- 
based IR engine. 



1 Introduction 

The research discussed in this volume is directed at the special character of 
Information Retrieval in the multilingual world which is the future of the information 
age. What special challenges must we be ready for as we prepare our document bases 
and document spaces for texts in a potentially unlimited number of langu^es? What 
additional technology must we develop in preparation for those challenges'ifl 

To the extent that current IR methods make assumptions about language which are 
valid for English but not for many other natural languages, these methods will need to 
be updated in the light of what we know about natural languages more generally. Our 
concern in the work reported here is the need for stemming (and related processes) 
that is fast, accurate, valid for as many languages as possible, and that assumes no 
human intervention in the process. 

We are currently in the process of developing software that accepts unrestricted 
corpora as input and produces, as its output, a list of stems and affixes found in the 
corpus, plus additional information about cooccurrence of affix and stem. It does this 
on the basis of no prior knowledge of the language found in the corpus. When linked 
to an automatic language identification system, such a system is able to add to our 
ability to control a large document base which must accept documents in any 
language — such as the Internet, for example. Although the testing done in the context 
of the CLEF experiments deals with some of the larger European languages, we see 
our approach as being most useful when it is used in relation to a database that 
includes a large number of documents from little-studied languages, because 
morphologies cannot be produced overnight by humans. 



' We are grateful for help and comments from Abraham Bookstein and Craig Swietlik. This 
work was supported in part by a grant from the University of Chicago-Argonne National 
Laboratory. 
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Our background is in linguistics and computational linguistics, rather than 
information retrieval (IR), but in the next section we will survey what we take to be 
the relevant background information regarding the character of stemming for IR in 
English and other languages. 



2 Multilingual Stemming 

The use of stemming in information retrieval systems is widespread, though not 
entirely uncontroversial. It is used primarily for que ry-stemming and document 
indexing. (Useful reviews may be found in Q, ||^, 

Stemming narrowest sense is "a processthat strips off affixes and leaves you 
with a stem" ^^132]. A broader procedure is conflation: "a computational procedure 
which identifies word variants and reduces them to a single cano^al form" ffTll771. 
Word variants are usually morphological 10131] or semantical ||^|633]. Stemming in 
the narrow sense is a type of conflation procedure. Very commonly, though, the term 
is used no t jus t in that narrow sense, but to refer to lemmatization ||Ti| 6541, or 
collapsing "Stemming" in query expansion refers to that second sense. For our 
purposes, stemming is taken in a broad, but not the broadest, sense. Any algorithm 
that results in segmenting a word into stem and affixes is a stemming algorithm, or 
stemmer. 

Significant factors for stemming performance in IR include the type of stemming 
algorithm, evaluation measures of retrieval s ucce ss, language-(in)dependence, query 
length, document length, and possibly others These issues have been addressed 
in many studies, but no clear comprehensive picture emerges from the literature. 

By its very nature^^emming is generally understood to improve recall, but to 
decrease precision ^9|124]. Most research on stemming in IR is on English, a 
language with a relatively simple morphology. In a study comparing three different 
stemmers of English, Harman Q found that losses in precision from stemming 
outweigh the benefits from increased recall. Krovetz 10 reported results conflicting 
with what Harman found f or th e Porter algorithm on the same collection using a very 
close evaluation measure o. and in general the view that overall stemming is 
beneficial for IR is discussed in Q, and 



2.1 Types of Stemmers and Evaluation Measures 

Stemmers may be linguistic, automatic or mixed. Linguistic stemmers use a linguist's 
knowledge of the structure of the language in one way or another, typically by 
providing manually compiled lists of su ffixe s, allomorphy niles, and the like. The best 
known stemmer of this sort is Porter 10, initiall^Meveloped for English. Porter's 
approach was extended to French and Italian |3fl and Dutch |^. Automatic 
stemmers rely on statistical procedures, such as frequency count, n-gram method, or 
some combination of these. Linguistic stemmers that rely on statistical me thod s as 
su^idiary proce^res may be called mixed. Such mixed systems include and 
Krovetz uses frequency of English derivational endings as the basis for 
incorporating them into the stemmer, and the initial shared trigram as a preliminary 
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procedure for finding words that are potentially morphologically related. Paice [P3| 
requires the words in a manually compiled semantic identity class to share the initial 
bigram. 

It has been pointed out in the literature that it is difficult to evaluate and compare 
the performance of different stemming algorithms for IR purposes because the 
traditional IR evaluation mea sure s are not aime d at highlighting the contribution of 
stemming to query success Several stu^s that cormare the 

effectiveness of different stemming algorithmWbr IR were 

conducted on English materials, with Paice and Hull^^developing new 
measures of evalu ating stemming performance for IR. The results are inconclusive. 

Lennon et al. evaluated seven stemming algorithms for English for their 
usefulness in IR. The automatic algorithms in this study were the RADCOL 
Hafer-Weiss ||^, a similarity stemmer developed by the authors on the basis of 
Adamson and Boreham's bigram stemmer |Ql, and a frequency algorithm developed 
by the authors on the basis of RADCOL. The linguistic stemmers were Lovins and 
Porter. The Hafer-Weiss algorithm fared much worse than all others. With this 
exception, they found an undeniable, but very slight improvement on stemmed 
queried compared to unstemmed ones. They also found "no relationship between the 
strength of an algorithm and the consequent retrieval effectiveness arising from its 
use". 

Harman Q tested three linguistic stemmers: Porter, SMART -enhanced Lovins 
stemmer, and the primitive s-stripping stemmer for IR effectiveness. She found that 
the minimal s-stemming did very little to improve IR effectiveness, and more rich 
stemmi ng hu rts precision as much as it improves the recall. 

Hull jroi evaluated five linguistic stemmers for English: ^-remover, an extensively 
modified Lovins stemmer. Porter stemmer. Xerox English inflectional analyzer and 
Xerox English derivational analyzer. He proposed a set of alternative evaluation 
measures aimed to distinguish performance details of various stemmers. In his 
analysis, stemming is much more helpful on short queries, on which the inflectional 
stemmer looks slightly less effective, and the Porter stemmer slightly better, than the 
others; the simple plural removal is less effective than more complex stemmers, but 
quite competitive when only a small number of documents is examined. His detailed 
analysis of queries shows how linguistic knowledge may be beneficial for IR in some 
cases (failure/fail — only the derivational stemmer makes this connection) but not in 
others {optics/optic — the derivational and inflectional stemmers do not make this 
connection). 

Paice developed a direct measure of evaluating accuracy of a stemmer "by 
counting the actual understemming and overstemming errors which it commits". He 
eval uated three stemmers for the English language — Porter, Lovins and Paice/Husk 
It was found that his measure provides a good representation of stemmer weight, 
but no clear comparison of accuracy for stemmers differing greatly in weight. There is 
no clear relationship between IR measures and Paice's evaluation. 

The upshot appears to be that for English, the c hoice of stemmer type ultimately 
does not matter much (though cf ^). Krovetz found that his inflectional 
stemmer always helped a little, but the importa nt im provement came from his 
derivational stemmer. Lennon et al. and Hull m found no overall consistent 
differences between stemming algorithms of various types, though on a particular 
query one algorithm might outperform other, but never consistently. Most studies note 
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that stemming performance varies on different collections. Paice notes that heavy 
stemmers might be preferable in situations where high recall is needed, and lighter 
stemmers where precision is more important. 

For languages with morphology richer than that of English, differences between 
inflectional and derivational morphology — and, consequently, between performance 
of stemmers oriented towards one or the other — should be greater. Stripping off 
inflectional morphology should result in more than slight recall improvement without 
significantly hurting precision. In Russian, for example, the nominal declension has 
two numbers and six cases (declension paradigms are determined by the gender of the 
noun and the phonological form of the stem). Dictionary entries are listed in the 
nominative singular, and one would expect most queries to be entered in the 
"dictionary form" — ^the nominative singular. However, actual occurrences of the word 
appearing in the texts could be more frequent in oblique cases and in the plural. For 
example, a search for the nominative singular of the word ruka 'hand' in Leo Tolstoy's 
Anna Karenina (over 345,000 words) would locate 18 occurrences of the exact match. 
The stem ruk, on the other hand, appears 690 times — in forms inflected for case and 
number. Most frequent forms are ruk-u (accusative singular) and ruk (genitive 



singular, nominative plural). Nozhov 



reports that all Russian IR systems 



routinely use stemming (linguistic or mixed) even when the degree of morphological 
recognition is not extremely high. 



Kraaij and Pohlmann compared the Porter-style algorithm they implemented 
for Dutch, another morphologically complex language, with their more linguistically 
sophisticated derivational and inflectional stemmers. The best performance was 
achieved by the inflectional stemming combined with a sophisticated version of 
compound splitting and generating. Applying both derivational and inflectional 
stemming generallyreduces precision too much. 

Wexler et al. |||^ developed a four-language search engine (French, Italian, 
German and English) with stemming implemented for each language. For German, a 
language morphologically close to Dutch, they apparently implemented some 
inflectional stemming and a dictionary-based compound-breaking algorithm. 

A derivational stemmer could produce a theoretically irreproachable result which is 
not just irrelevant, but harmful for IR purposes, since the stem and its derivates are 
rarely fully synonymous. The problem is to distinguish derivation that preserves word 
sense relevant to the query from the derivation that does not. Hull's study gives 
examples of the derivational stemmer outperforming others on queries like bank 
failures {failure converted to fail), and superconductivity (stem superconduct 
conflated with the one in superconductors). Since the relevant documents contained 
both failure and fail, and superconductors rather than superconductivity, the 
stemming was beneficial. However, in cases like client-server architecture (conflate 
with serve) and Productivity Statistics for the U.S. Economy (conflate with produce) 
the linguistically correct analysis lowers precision dramatically, since serve and 
produce have a much less specific meaning than the query term. The le xica l 
^valence requirement may be maintained through manually compiled lists (|^^ 
for English), or by word sense disambiguation in a full-blown NLP system (BIJ 
for French). 
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2.2 Automatic Stemmer on More than One Language 

The increasingly multi-language character of IR ^ presents a special challenge to 
language-specific tools. Statistical language processing tools, with their universality 
and speed, are understandably attractive in this regard. Whether stemming based on 
such universal methods helps to increase accuracy and scope of IR is a question 
without a definitive answer yet. 

Xu and Croft m tested the performance of an automatic trigram stemmer, a 
"general^mrpose language tool" against the performance of Porter stemmer and 
KStem on English and Spanish corpora for construction of "initial equivalence 
classes". The initial equivalence classes were further refined with statistical methods 
that differed for English and Spanish. The "trigram approach" was used as an 
auxiliary procedure to clean up the equivalence classes for English after the 
application of the connected component algorithm: A "prefix" in an equivalence class 
is defined as "an initial character string shared by more than 100 words. Examples are 
con, com and inter. If the next 3 characters after the common prefix do not match, the 
similarity metric is set to 0. Thus, the trigram model is at work again, shifted further 
inside the string. The results were comparable with the performance of the linguistic 
Porter and KStem stemmers, showing some portability problems due to corpus- 
specific character of equivalence classes. 



2.3 Compounds 

As virtually all studies on IR in German have docu ment ed (and as reported in this 
year's CLEF results by the West Group; see also it is crucial to analyze 

compound words in German, and no doubt in other languages with similar use of 
compound structures. Use of automatic morphology can be of significant help in this 
area, as reported in ^ in connection with Automorphology. Because our algorithm 
identifies stems, it is possible to identify compounds, which take the form Stem- 
Linker-Stem-Suffix; that is, the first half of the compound need not be a free-standing 
word. 



3 Automatic Morphology 

The identification of a lexical stem consists of the identification of a string of letters 
which co-occurs in a large corpus with several distinct suffixes, and typically we will 
find consistent sets of suffixes that appear with a wide range of stems. This 
observation serves as one of the bases for our algorithm, whose goal is to establish as 
wide a range of stems and suffix possibilities as possible, given a corpus from a 
natural language. The following discussion is a summary of material presented in 
SB . Its goal is to establish a method which is language-independent, to the extent 
possible, and which will provide a useful result despite the lack of any human 
oversight by a speaker of the language in question. 

There are several methods that can be used to establish an initial set of candidate 
suffixes on a statistical basis, given a sample of an unknown language. One of the 
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simplest is to consider all word-final sequences of six or fewer letters {schaft is a 
German suffix), and to rank their coherence in the text on the basis of the formula in 
(1). In order to deal appropriately with single-letter suffixes, it is preferable to 
consider all words to end with a special symbol, and to increase the maximum size to 
seven letters. The frequency of a letter is defined as the number of occurrences of the 
letter in the text divided by the total number of letters in the text. 



/reqr(/j/2.../Jlog 






( 1 ) 



freqik )freq{l2)...freq{l „ ) 



We select the top 100 suffixes ranked by coherence (1) (these are our candidate 
suffixes), and divide all words into stem and suffix if they end in one or more 
candidate suffixes. We associate with each such candidate stem the set of suffixes it 
occurs with, and call each such set a candidate signature. We accept only signatures 
with at least two suffixes, and we establish a threshold number of stems which a 
signature must be associated with, failing which a signature is eliminated; a suitable 
threshold is 5. 

Various improvements can be made to the results at this point. For example, 
common combinations of suffixes are certain to be identified as suffixes (e.g., ments, 
ings in English), but they can be identified and their stems reanalyzed. A large part of 
our work is devoted to determining in an abstract way what kinds of errors our 
algorithms are likely to create, to determine what they are, and to find ways either to 
avoid the errors or to undo them after the fact, but always without human 
intervent ion. Our current system is heavily based on a Minimum Description Length 
analysis one consequence of which is that if a language has an unusually high 
frequency of occurrence of a specific letter in stem-final position, it is likely to be 
misanalyzed as being part of a suffix; this is the case for t in English. When viewed 
close up, suffix systems tend to have certain kinds of orthographic structures which 
derive from their history and which can confuse an automatic analyzer; for example, 
Romance languages contain sets of verbal suffixes which are derived historically from 
inflected forms of Latin habere, which itself has a stem-suffix structure. The suffixes 
-ai,-ais, -ait, etc., of French may in some cases wrongly be analyzed as being -i,- is,- 
it, and attached to a stem that ends in -a. We employ the techniques of Minimum 
Description Length in order to select the analysis of the complete corpus which is 
most compact overall and which provides the most succinct and accurate analysis of 
the stem/ suffix distribution. 

There are two notions at the heart of the MDL approach. The first is that an 
analysis (here, the morphological analysis of a corpus) must provide a probabilistic 
measure of the data; this allows us to assign an optimal compressed length to the 
corpus on the basis of that model, for reasons central to information theory. In this 
case, each word of the corpus is identified as belonging to one of a relatively limited 
number of stem groupings defined by the set of suffixes the stem appears with in the 
corpus; this grouping is called a signature, and each signature is associated with an 
empirical probability. Each word in the corpus is also associated with a stem and a 
suffix, and these associations are assigned an empirical probability, conditioned by 
the signature of the word. Each of these three probabilities (signature, stem, suffix) 
for each word is converted to an optimal compression length (which equals the 
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logarithm of the reciprocal of the probability), and the sum of these optimal 
compression lengths is the compressed length assigned to the corpus by the 
morphological model, measured in bits. The shorter that total length, the better the 
morphology models the corpus. 

The second notion at the heart of MDL is that length of the model itself can be 
measured in bits, and the optimal analysis of the corpus is that for which the sum of 
the length of the model and the compressed length of the corpus is the smallest. Our 
algorithm searches the space of possible analyses by considering changes to the 
signature set, to the affix set, and to the stem/ suffix separation, evaluating and 
accepting each change only if the change brings about a decrease in the total 
description length of the (corpus + morphology). 




Fig. 1. The basic design of the Chicago IR system, using Automorphology to stem terms 
from queries and documents, and employing standard SMART vector-based retrieval. 



4 Experiment 

The information retrieval engine we used in our CLEF experiments is based on the 
freely-available SMART system, running under the Linux operating system on a 
commodity, off-the-shelf PC. We modified the system to incorporate our custom 
stemmer, which was automatically derived from the corpora for each language. The 
results of applying our stemmer to the document collections were stored in a file for 
SMART to consult at the time of indexing the documents and queries. A schematic 
diagram of our system architecture is presented in Figure 1. Although not represented 
in Figure 1, statistical compound-breaking using Automorphology was also 
performed on the German collection before indexing the documents and queries. 

The vector-based SMART backbone is a simple retrieval model, treating each 
document as an unordered “bag” (i.e., retaining only frequency information), and 
computing document-query similarity by means of the cosine distance between these 
two vectors. Our expectations regarding results in this experiment were therefore 
guarded. Our hope is that these runs will help to highlight the strengths and 
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weaknesses of the statistical approach to stemming for IR, and point out directions for 
us to progress in our development of Automorphology. 



4.1 Generation of the Stop and Stem Lists 

As a stopword list for each language, we created a list of the approximately 300 
highest-frequency words in a corpus of the language, and removed by hand any 
entries that appeared obviously inappropriate. While the resulting stop lists were by 
no means perfect, the lists were not long enough to create a serious problem with 
incorrect stopwords blocking the retrieval of documents which ought to be returned. 
Imperfect stoplists might, however, be blamed for not filtering out as many 
documents as they should, and thereby reducing our system’s precision. Since our 
results do not seem to display a profile of high recall offset by low precision, the 
stoplists do not seem to be an area in which to look for major improvements. 

The stem file for each language, which associates terms with their stem forms for 
indexing (a stem may be identical with the term itself), was produced by running our 
statistical stemming program. Automorphology, on the document collection for each 
language. The length of time that this process required varied from three days, for the 
Italian document collection, to as much as fourteen days for German, with its higher 
mean word length and larger document collection. Improvements in the algorithm 
since that work has speeded up these times considerably. The stems produced by 
Automorphology were accepted without any sort of human revision; the only 
constraint we imposed was that no stem could be shorter than three letters in length. 
While we do not have a concrete analysis of the conflation classes produced by our 
stemmer for each language, it seems likely that some of our performance deficit is due 
to permitting the stemmer to apply so freely. 



4.2 Indexing 

The indexing of documents and queries was done using standard SMART facilities, 
with the inclusion of the stemming routine described above into the process. Terms in 
document and query vectors were weighted according to the tf*idf measure which has 
proven effective in previous IR work. Our group used all of the permissible data 
fields for retrieval in each of our experiments. 

Our performance on the CLEF monolingual runs might have been improved if we 
had invested more time in preprocessing the document collections. We did not, for 
example, handle issues related to diacritics at all. Thus, our system would not 
conflate French Ecole with Ecole, or German miissen with muessen. However, such 
issues were probably not a major factor in determining the system’s retrieval 
accuracy. Another interesting area for future exploration is the relative contribution of 
statistical stemming and statistical compound-breaking in indexing the German 
document collection. Intuitively, decompounding is less likely to do harm, since it 
alters terms which are less likely to be independently searched on anyhow, but it also 
has less potential for improvement of retrieval accuracy, because compounds are 
simply less frequent than non-compounds. 
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4.3 Retrieval 

Once SMART was configured to use this new stemmer, the retrieval process for each 
language was straightforward. SMART uses the vector-space model to retrieve the 
documents most similar to the queries, using the stemmed forms of words as 
components of the vectors. We returned a ranked list of the top 1000 documents 
returned for each query, the maximum number allowed. 



5 Results 

Our system was run in monolingual IR tests in the CLEF project in 2000 involving 
Italian, French, and German. The principal results are presented in Figure 2. 



Precision averages for interpoiated recaiis 




Fig. 2. Precision rates for CLEF experiments on French, German, and Italian 



6 Conclusions 

Our work in the area of IR is still in its preliminary stages, and we hesitate to draw 
any conclusions at this time from the quantitative results described here. If our work 
has a long-run contribution to make, it is as a component of a larger IR package, and 
indeed, Oard et ah, in this volume, describe experiments employing our automatic 
morphological analyzer which in some regards go further than our own pre-conceived 
ideas of its applicability. We are currently engaged in drastically reducing the time 
and storage needs of the algorithm to permit it to be used with databases of the 
magnitude typical of IR tasks, and we will continue to test the value of this work for 
IR tasks. 
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Appendix A — Run Statistics 



This appendix contains the evaluation results for the CLEF 2000 runs. The 
initial pages list each of the runs (identified by the run tags) that were officially 
submitted. Associated with each tag is the organization that produced the run, 
the type of task, the language used for the topic, the type of query (automatic or 
manual), the topic fields used to construct the query, and the run status (used 
for pooling or not). The run list is followed by a description of the evaluation 
measures used for the evaluation. The remainder of the appendix contains the 
evaluation results themselves, in the order given in the run list. 

The appendix is based on material provided to us by NIST (Donna Harman 
and Ellen Voorhees). 



C. Peters (Ed.): CLEF 2000, LNCS 2069, pp. 285-ESl 2001. 
(c) Springer-Verlag Berlin Heidelberg 2001 
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Characteristics of Submitted Runs 



Runtag 


Institution 


Task 


Top. Lang. 


Type 


Top. Fields 


Judged? 


apibifra 


Johns Hopkins U/APL 


bi 


F 


auto 


TDN 


Y 


apibifrb 


Johns Hopkins U/APL 


bi 


F 


auto 


TDN 


N 


apibifrc 


Johns Hopkins U/APL 


bi 


F 


auto 


TDN 


N 


apibispa 


Johns Hopkins U/APL 


bi 


Sp 


auto 


TDN 


N 


apimofr 


Johns Hopkins U/APL 


mono 


F 


auto 


TDN 


Y 


apimoge 


Johns Hopkins U/APL 


mono 


G 


auto 


TDN 


Y 


apimoit 


Johns Hopkins U/APL 


mono 


I 


auto 


TDN 


Y 


apimua 


Johns Hopkins U/APL 


multi 


E 


auto 


TDN 


Y 


apimub 


Johns Hopkins U/APL 


multi 


E 


auto 


TDN 


N 


backoff4 


U Maryland 


multi 


E 


manual 


TDN 


Y 


backoff4Ling 


U Maryland 


multi 


E 


manual 


TDN 


N 


BKGREGA1 


UC Berkeley 


girt 


E 


auto 


TDN 


Y 


BKGREGA2 


UC Berkeley 


girt 


E 


auto 


TDN 


Y 


BKGREGA3 


UC Berkeley 


girt 


E 


auto 


TDN 


Y 


BKGREGA4 


UC Berkeley 


girt 


E 


auto 


TDN 


Y 


BKMOFFA2 


UC Berkeley 


mono 


F 


auto 


TDN 


Y 


BKMOGGA1 


UC Berkeley 


mono 


G 


auto 


TDN 


Y 


BKMOGGM1 


UC Berkeley 


mono 


G 


manual 


TDN 


Y 


BKMOIIA3 


UC Berkeley 


mono 


I 


auto 


TDN 


Y 


BKMUEAA1 


UC Berkeley 


multi 


E 


auto 


TDN 


Y 


BKMUEAA2 


UC Berkeley 


multi 


E 


auto 


TDN 


N 


BKMUGAA2 


UC Berkeley 


multi 


G 


auto 


TDN 


N 


BKMUGAM1 


UC Berkeley 


multi 


G 


manual 


TDN 


N 


BLBabel 


U Dortmund 


bi 


G 


auto 


TDN 


Y 


BLLeo 


U Dortmund 


bi 


G 


auto 


TDN 


N 


CWlOOOO 


CWI 


mono 


G 


auto 


TD 


Y 


CWI0001 


CWI 


mono 


I 


auto 


TD 


Y 


CWI0002 


CWI 


mono 


F 


auto 


TD 


Y 


CWI0003 


CWI 


bi 


D 


auto 


TD 


Y 


CWI0004 


CWI 


multi 


D 


auto 


TD 


Y 


EITCLEFFF 


Eurospider 


mono 


F 


auto 


TDN 


Y 


EITCLEFGG 


Eurospider 


mono 


G 


auto 


TDN 


Y 


EITCLEFII 


Eurospider 


mono 


I 


auto 


TDN 


Y 


EITCLEFM1 


Eurospider 


multi 


G 


auto 


TDN 


N 


EITCLEFM2 


Eurospider 


multi 


G 


auto 


TDN 


N 


EITCLEFM3 


Eurospider 


multi 


G 


auto 


TDN 


Y 


finstr 


U Tampere 


bi 


Fi 


auto 


TD 


N 


French UCWLP 


U Chicago 


mono 


F 


auto 


TDN 


Y 


GermanUCWLP 


U Chicago 


mono 


G 


auto 


TDN 


Y 


gerstr 


U Tampere 


bi 


G 


auto 


TD 


N 


geruns 


U Tampere 


bi 


G 


auto 


TD 


N 


GIRTBabel 


U Dortmund 


girt 


E 


auto 


TDN 


Y 


GIRTML 


U Dortmund 


girt 


G 


auto 


TDN 


Y 


glalong 


U Glasgow 


multi 


E 


auto 


TDN 


N 


glatitle 


U Glasgow 


multi 


E 


auto 


T 


Y 


iaiphsrun 


lAI 


multi 


E/Sw 


auto 


T 


Y 


iritt bfr2en 


Irit 


bi 


F 


auto 


TDN 


Y 


iritt men2a 


Irit 


multi 


E 


auto 


TDN 


Y 


irit2bfr2en 


Irit 


bi 


F 


auto 


TDN 


N 


irit2men2a 


Irit 


multi 


E 


auto 


TDN 


N 


iritmonofr 


Irit 


mono 


F 


auto 


TDN 


Y 


iritmonoge 


Irit 


mono 


G 


auto 


TDN 


Y 


iritmonoit 


Irit 


mono 


I 


auto 


TDN 


Y 


irstt 


ITC-irst 


mono 


I 


auto 


TDN 


Y 


irst2 


ITC-irst 


mono 


I 


auto 


TDN 


Y 


ItalianUCWLP 


U Chicago 


mono 


I 


auto 


TDN 


Y 


MLgerman 


U Dortmund 


mono 


G 


auto 


TDN 


Y 




nmsuK 

nmsuS 

pruebaO 

ralie2allh1 

ralie2allh2 

ralie2allmix 

ralie2allwac 

ralif2emixf 

ralif2ewacf 

ralif2f 

ralif2ff 

ralig2ewacf 

ralig2gf 

ralii2ewacf 

ralN2if 

shefbi 

shefes 

shefni 

sheftri 

swestr 

SYRD2E 

tnoutdd2 

tnoutexi 

tnoutex2 

tnoutex3 

tnoutff2 

tnoutii2 

tnoutnel 

tnoutne2 

tnoutneS 

tnoutne4 

tnoutnxi 

unstemmed 

WESTff 

WESTggI 

WESTgg2 

XRCEGO 

XRCEGIRTO 



Explanations: 

Task: 

Topic Language: 
Type: 

Topic Fieids: 
Judged?: 
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mono 
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muiti = muitiiingual, bi = bilinguai, mono = monolinguai, girt = GiRT 
E = Engiish, F = French, G = German, i = itaiian, D = Dutch, Sp = Spanish, 
Sw = Swedish, Fi = Finnish 

auto = automatic (no manuai intervention), manuai = manual intervention 
T = titie, D = description, N = narrative 
Y = run was used for pooiing, N = run was not used for pooiing 
The documents in the pooi were judged by human assessors. 
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Evaluation Techniques and Measures 



1 Methodology 

The CLEF evaluation uses procedures very similar to those employed in the 
“ad hoc” task of the TREC conferences. Such “ad hoc” topics are similar to 
what a researcher might for example use in a library environment. This implies 
that the input topic has no training material such as relevance judgments to 
aid in the construction of the input query. Systems ran CLEF topics against all 
documents in the languages relevant for the task they were performing (multi- 
lingual, bilingual or monolingual; or GIRT). 



2 Evaluation Measures 



1. Recall 

A measure of the ability of a system to present all relevant items. 



recall = 



number of relevant items retrieved 
number of relevant items in collection 



2. Precision. 

A measure of the ability of a system to present only relevant items. 



number of relevant items retrieved 

precision = ^ ^ — : ^ — 

total number of items retrieved 



Precision and recall are set-based measures. That is, they evaluate the quality 
of an unordered set of retrieved documents. To evaluate ranked lists, precision 
can be plotted against recall after each retrieved document as shown in the 
example below. To facilitate computing average performance over a set of topics, 
each with a different number of relevant documents, individual topic precision 
values are interpolated to a set of standard recall levels (0 to 1 in increments of 
.1). The particular rule used to interpolate precision at standard recall level i is 
to use the maximum precision obtained for the topic for any actual recall level 
greater than or equal to i. Note that while precision is not defined at a recall of 
0.0, this interpolation rule does define an interpolated value for recall level 0.0. In 
the example, the actual precision values are plotted with circles (and connected 
by a solid line) and the interpolated precision is shown with the dashed line. 
Example: Assume a document collection has 20 documents, four of which are 
relevant to topic t. Further assume a retrieval system ranks the relevant docu- 
ments first, second, fourth, and fifteenth. The exact recall points are 0.25, 0.5, 
0.75, and 1.0. Using the interpolation rule, the interpolated precision for all stan- 
dard recall levels up to .5 is 1, the interpolated precision for recall levels .6 and 
.7 is .75, and the interpolated precision for recall levels .8 or greater is .27. 
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Recall 



3 System Results Description 

The evaluation results are given in the main body of the appendix: one page per 
run. Each page is comprised of a table and two graphs. These are explained in 
the following. 

3.1 The Table 

Figures are generated by trec-eval courtesy of Chris Buckley using the SMART 
methodology. The table has two columns. 

1. Statistics 

The right column contains some general statistics about the run: the number 
of documents that were submitted (usually number of topics times 1000), the 
total number of relevant documents for the given task, and the actual number 
of relevant documents retrieved by that run. 

2. Interpolated Recall - Precision Averages Table. 

Figures are also located in the right column, below the general statistics. 
The precision averages at 11 standard recall levels are used to compare the 
performance of different systems and as the input for plotting the recall- 
precision graph (see below). Each recall-precision average is computed by 
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summing the interpolated precisions at the specified recall cutoff value (de- 
noted by ^ P\ where P\ is the interpolated precision at recall level A) and 
then dividing by the number of topics. 



NUM 



i=\ 



NUM 



A = {0.0, 0.1, 0.2, 0.3,..., 1.0} 



— Interpolating recall-precision 

Standard recall levels facilitate averaging and plotting retrieval results. 



3. Average precision over all relevant documents, non-interpolated 

This is a single-value measure that reflects the performance over all rele- 
vant documents. It rewards systems that retrieve relevant documents quickly 
(highly ranked). 

The measure is not an average of the precision at standard recall levels. 
Rather, it is the average of the precision value obtained after each relevant 
document is retrieved. (When a relevant document is not retrieved at all, 
its precision is assumed to be 0.) As an example, consider a query that has 
four relevant documents which are retrieved at ranks 1, 2, 4, and 7. The 
actual precision obtained when each relevant document is retrieved is 1, 1, 
0.75, and 0.57, respectively, the mean of which is 0.83. Thus, the average 
precision over all relevant documents for this query is 0.83. 

The left column additionally gives the average precision for individual queries. 

4. Precision Table 

At the bottom of the right column, “document level averages” are reported. 
- Precision at 9 document cutoff values. The precision computed after a 
given number of documents has been retrieved reflects the actual measured 
system performance as a user might see it. Each document precision average 
is computed by summing the precisions at the specified document cutoff 
value and dividing by the number of topics (40). 

5. R-Precision 

R-Precision is the precision after R documents have been retrieved, where 
R is the number of relevant documents for the topic. It de-emphasizes the 
exact ranking of the retrieved relevant documents, which can be particularly 
useful in TREC where there are large numbers of relevant documents. 



The average R-Precision for a run is computed by taking the mean of the 
R-Precisions of the individual topics in the run. For example, assume a run 
consists of two topics, one with 50 relevant documents and another with 10 
relevant documents. If the retrieval system returns 17 relevant documents 

in the top 50 documents for the first topic, and 7 relevant documents in the 

17 , 7 



top 10 for the second topic, then the run’s R-Precision would be 
0.52. 




or 
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3.2 Graphs 

1. Recall-Precision Graph 

Figure m is a sample Recall-Precision Graph. 



Run SAMPLE (Precision-Recall Curve) 




Recall 



Fig. 1. Sample Recall-Precision Graph. 



The Recall-Precision Graph is created using the 11 cutoff values from the Re- 
call Level Precision Averages. Typically these graphs slope downward from 
left to right, enforcing the notion that as more relevant documents are re- 
trieved (recall increases), more nonrelevant documents are retrieved (preci- 
sion decreases). 

This graph is the most commonly used method for comparing systems. The 
plots of different runs can be superimposed on the same graph to determine 
which run is superior. Gurves closest to the upper right-hand corner of the 
graph (where recall and precision are maximized) indicate the best perfor- 
mance. Gomparisons are best made in three different recall ranges: 0 to 0.2, 
0.2 to 0.8, and 0.8 to 1. These ranges characterize high precision, middle 
recall, and high recall performance, respectively. 
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2. Average Precision Histogram. 

Figure 12 is a sample Average Precision Histogram. 



Difference from Median in Avg. Precision per Topic 




Fig. 2. Sample Average Precision Histogram. 



The Average Precision Histogram measures the average precision of a run 
on each topic (see also left column of the statistics table) against the me- 
dian average precision of all corresponding runs on that topic. This graph is 
intended to give insight into the performance of individual systems and the 
types of topics that they handle well. 



Run apibifra (Precision-Recall Curve) 
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Run apibifrb (Precision-Recall Curve) 
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Run apibifrc (Precision-Recall Curve) 
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Run apibispa (Precision-Recall Curve) 
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Run apimofr (Precision-Recall Curve) 
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Run apimoge (Precision-Recall Curve) 








Run apimoit (Precision-Recall Curve) 
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Run apimua (Precision-Recall Curve) 








Run apimub (Precision-Recall Curve) 
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Run backoff4 (Precision-Recall Curve) 
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Run backoff4Ling (Precision-Recall Curve) 
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Run BKGREGA1 (Precision-Recall Curve) 








Run BKGREGA2 (Precision-Recall Curve) 
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Run BKGREGA3 (Precision-Recall Curve) 








Run BKGREGA4 (Precision-Recall Curve) 
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Run BKM0FFA2 (Precision-Recall Curve) 
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Run BKM0GGA1 (Precision-Recall Curve) 
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Run BKM0GGM1 (Precision-Recall Curve) 
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Run BKM0IIA3 (Precision-Recall Curve) 
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Run BKMUEAA1 (Precision-Recall Curve) 








Run BKMUEAA2 (Precision-Recall Curve) 
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Run BKMUGAA2 (Precision-Recall Curve) 








Run BKMUGAM1 (Precision-Recall Curve) 
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Run BLBabel (Precision-Recall Curve) 
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Run BLLeo (Precision-Recall Curve) 
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Run CWlOOOO (Precision-Recall Curve) 








Run CWI0001 (Precision-Recall Curve) 
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Run CWI0002 (Precision-Recall Curve) 
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Run CWI0003 (Precision-Recall Curve) 
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Run CWI0004 (Precision-Recall Curve) 








Run EITCLEFFF (Precision-Recall Curve) 
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Run EITCLEFGG (Precision-Recall Curve) 








Run EITCLEFII (Precision-Recall Curve) 
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Run EITCLEFM1 (Precision-Recall Curve) 
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Run EITCLEFM2 (Precision-Recall Curve) 



Appendix A - Run Statistics 




90- 






Run EITCLEFM3 (Precision-Recall Curve) 








Run finstr (Precision-Recall Curve) 
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Run FrenchUCWLP (Precision-Recall Curve) 
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Run GermanUCWLP (Precision-Recall Curve) 
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Run gerstr (Precision-Recall Curve) 
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Run geruns (Precision-Recall Curve) 
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Run GIRTBabel (Precision-Recall Curve) 
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Run GIRTML (Precision-Recall Curve) 
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Run glalong (Precision-Recall Curve) 
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Run glatitle (Precision-Recall Curve) 
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Run iaiphsrun (Precision-Recall Curve) 
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Run irit1men2a (Precision-Recall Curve) 
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Run irit2men2a (Precision-Recall Curve) 




uoispejd 



aou0j0j^!a 





U ft =! 

■H T! H(NM^ln^^•03cnoHCNM^ln^^•coc^oHCNM^u^y^^•coc^oH<NM^lny^r'Coc^o 

4J (U-H OOOOOOOOOHHHHHHHHHH<N(NCNfN(N<N(NCNtNCNmMmMrOmMmMrO'3' 



■3 §’-| 

(T3 (US oa)(Da)(D(i)(D(ua)(U(i)(D(Da)(U(i)(D(i)a)(ua)(D(i)a)(ua)(D(i)(D(ua)(D(i)(D(ua)(U(i)(D(i) 







Run iritmonofr (Precision-Recall Curve) 
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Run iritmonoge (Precision-Recall Curve) 








Run iritmonoit (Precision-Recall Curve) 
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Run irst1 (Precision-Recall Curve) 
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Run irst2 (Precision-Recall Curve) 
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Run ItalianUCWLP (Precision-Recall Curve) 
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Run MLgerman (Precision-Recall Curve) 
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Run nmsuK (Precision-Recall Curve) 
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Run nmsuS (Precision-Recall Curve) 
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Run pruebaO (Precision-Recall Curve) 
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Run ralie2allh1 (Precision-Recall Curve) 
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Run ralie2allh2 (Precision-Recall Curve) 
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Run ralie2allwac (Precision-Recall Curve) 
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Run ralif2ewacf (Precision-Recall Curve) 
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Run ralif2f (Precision-Recall Curve) 
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Run ralif2ff (Precision-Recall Curve) 
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Run ralig2ewacf (Precision-Recall Curve) 
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Run ralig2gf (Precision-Recall Curve) 
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Run ralii2if (Precision-Recall Curve) 
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Run shefbi (Precision-Recall Curve) 
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Run shefes (Precision-Recall Curve) 








Run shefni (Precision-Recall Curve) 
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Run sheftri (Precision-Recall Curve) 








Run swestr (Precision-Recall Curve) 
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Run SYRD2E (Precision-Recall Curve) 
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Run tnoutexi (Precision-Recall Curve) 




uoispejd 



aou0j0j^!a 



VlT3.. 

I 0) 0 +j • 

I 5 > fl A 

I o' 0 ITJ ( 



(U 0 






J u u 
3 0 0 
3 T3 T3 



U U U 



U U U 
-§ -O "U 






O '5' >XI (N H O <Tl 

o ® <N ^ ro o ^ 



^r'®rOH®'£i(N 





g 




ft S 

T! H(NM^ln®^•WCnOHCNM^ln®^•WC^OHCNM^U^®^•WC^OH<NM^ln®r'WC^O 

(D-H OOOOOOOOOHHHHHHHHHH<N(NCNfN(N<N(NCNtNCNmMmMrOmMmMrO'3' 







Run tnoutex2 (Precision-Recall Curve) 



Appendix A - Run Statistics 




90- 






Run tnoutexS (Precision-Recall Curve) 




uoispejd 



aou0j0j^!a 




3 H ^ 'J' ^ 



oa)(Da)(D(i)(D(ua)(U(i)(D(Da)(U(i)(D(i)a)(ua)(D(i)a)(ua)(D(i)(D(ua)(D(i)(D(ua)(U(i)(D(i) 







Runtnoutff2 (Precision-Recall Curve) 



Appendix A - Run Statistics 




90- 






Run tnoutii2 (Precision-Recall Curve) 



Appendix A - Run Statistics 




uoispejd 



aou0j0j^!a 




fN 




g 



H(N^Ln^■COO^OH<NMLf^'XI^•a3C^OH(NM^Lf^^C^OHCNM^U^y^r'CO<rl 

OOOOOOOHHHHHHHHHCNtNtN<N(NCNfN(NmMrOMrOmMmrOrO 



oa)(Da)(D(i)(D(ua)(U(i)(D(Da)(U(i)(D(i)a)(ua)(D(i)a)(ua)(D(i)(D(ua)(D(i)(D 






Run tnoutnel (Precision-Recall Curve) 



Appendix A - Run Statistics 




90 - 
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Run unstemmed (Precision-Recall Curve) 
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